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Chapter  1 


Executive  Summary 


Tliis  is  the  liiiul  icchnieal  roporl  I'ur  (he  iirojcet  Real-'!‘iiiic,  Iiilcractive  Spoken  l.uii- 

guago  Sysicnis,  sponsored  by  the  Advanced  Rcscarcli  Projects  Agency  (ARPA)  and  mon¬ 
itored  by  ONR  undc"  contract  No,  N00014-92-C-0{)35  (BBN  Reference  Number  11617) 
during  the  period  16  ‘kpril  1992  to  30  June  1994. 

The  objective  of  this  project  was  to  make  the  next  signilicant  advance  in  hnman-machine 
interaction  by  dcvelopintt  a  spoken  language  system  (SLS)  that  operates  in  real-time  while 
maintaining  high  accuracy  on  cost-ciTecli'  c  COTS  (commercial,  off-the-shelf)  hardware. 
The  system  has  a  highly  interactive  user  inlcrfacc,  is  largely  user  independent  and  to  be 
easily  portable  to  new  applications.  The  BBN  HARC  spoken  Language  system  consists 
of  the  Byblos  speech  recognition  system  and  the  Delphi  or  HUM  language  understanding 
syEtem. 

Our  re.search  has  concentrated  on  the  development  of  an  effective  SLS  system  (that  is, 
the  i''?cgration  between  speech  and  language  processing),  advances  in  language  processing 
to  move  away  from  a  pipeline,  syntax-first  architecture  to  one  in  which  syntax  and  semantics 
play  con.plementary  roles,  processing  ill-formed  input  in  a  principled,  domain-independent 
way,  and  developing  a  complcicly  novel  approach  to  language  understanding. 

The  BYBLOS  s|>eech  recognition  system  uses  a  novel  four-pass  search  strategy.  It 
produces  ordered  lists  of  the  N  top-scoring  hypothc.ses  (N-best)  which  arc  then  reordered 
by  several  detailed  knuwledgc  sources.  The  N-best  strategy  [48,  3,  26,  63,  64,  65)  permits 
the  use  of  computationally  prohibitive  models  by  greatly  reducing  tlie  search  space  to  a  few 
dozen  word  sequences.  It  has  enabled  us  to  use  cross-word-boundary  triphone  models  and 
trigram  language  models  with  ease.  The  N-best  list  is  also  a  robust  interface  between  speech 
and  natural  language  that  provides  a  way  to  lecover  from  s[reech  errors  in  the  top  choice 


8 


BlSi\  Swsicins  and  'rm  linoln^ies 


WDIcl  sCt|llCIK'0. 

The  Delphi  hinpuage  undersianding  sysicni  use--  a  deliiiiie  clause  grainniar  lormalism, 
augmeiHed  by  ihe  use  of  consiraini  nodes  1691  and  a  labelled  argument  rormalisi  i.  The 
parsing  algorithm  uses  a  statistically  trained  agenda  to  produce  the  single  best  parse  for  an 
input  utterance  120).  Wc  have  added  a  robust  I'ragmci.t  parser  to  deal  with  speech  errors 
'when  the  correct  answer  is  not  in  the  N-best  list  123|.  We  developed  a  new  hybrid  method 
of  represeniallon  that  combines  the  best  features  of  logical  and  frame  representalioi'.^.  One 
of  the  most  important  features  of  Delphi  is  the  Semantic  Linker,  which  is  able  to  handle 
ill-formed  input  by  linking  partially-understood  fracnents  on  the  basis  of  meaning,  without 
reiiuiring  a  full  syntactic  tinalysis. 

A  novel,  potentially  high-|)ayoff,  approach  to  language  understanding  was  initiated  under 
this  contract.  The  approach,  called  IIDM  (Hidden  l'nderstaai,ling  Model)  incorporates  a 
statistical  model  of  meaning,  If  successful,  lll'M  will  leatl  lo  automatic  acquisition  of 
linguistic  knowledge,  coupled  with  high  irciTormance.  An  initial  version  of  HUM  w'as 
implemented  and  tested  in  ihe  Airline  Travel  Information  Service  (A'l'lS)  domain.  The  test 
showed  the  basic  soundness  of  this  novel  approach  l-i6,  43), 

In  the  area  of  portability,  we  c.xamined  several  new  tasks  as  possible  targets  for  porting 
the  spoken  language  technology,  including  shared-map  planning,  multi-media  conferencing, 
and  other  database  applications.  Wc  chose  SPLINT,  a  database  of  information  about  Air 
Force  ba.ses  and  equipmcni,  and  ported  the  HARC  SLS  to  that  database  in  conjunction  with 
another  contract  |6,  I4|. 

During  the  period  of  this  contract,  altliough  not  as  part  of  it,  I3BN  released  the  first  version 
of  the  HARK™spccch  recognition  system  as  a  commercia  proc'  ict.  HARK  is  based  in  part 
on  BYBLOS  speech  recognition  technology.  HARK  is  the  lirsl  product  of  its  kind  to  run 
in  real-time  with  a  large  vocabulary  (over  2000  words)  on  off-the-shelf,  audio-capable  Unix 
workstations.  A  companion  product,  the  HARK  Prototyper™,  allows  the  HARK  vocabulary 
and  grammar  to  be  configured  by  application  programmers. 

Major  accomplishments  achieved  during  this  project  include; 

1.  Designed  and  implemented  an  initial  version  of  HUM  (hidden  understanding 
model),  a  new  method  for  language  understanding,  based  on  learning  a  statistical 
model  of  semantics  from  annotated  data.  The  system  minimizes  the  labor-intensive 
writing  of  semantic  rules  and  automates  system  training  from  data,  resulting  m  a 
greater  level  of  domain  independence. 

2.  Began  the  development  of  tools  and  processes  for  porting  a  spoken  language 
system  to  a  new  domain.  The  tools  were  tested  by  jxirforming  an  actual  port,  under  a 


IlHS  .S'\s7(V):v  iiiul  'ri‘clini>li>vii’.\ 
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Riiiiic  L.ih  coim.iL'i,  10  .1  (.l.iialvi'-c  coiii.nnliij  iiiform.itioii  .ilioLil  Air  l  oicc  bases  and 
I'qUIliIlK’nl. 

.V  I’roposcd  concrcic  sie-ps  lowaid  daliiiiiig  the  Semlival  cvalualion  niclhodology  -  a 
doinain-indciKndciu  niclhodology  for  die  ei'aluaiion  of  sciiiaiitics  capabilities  -  and 
developed  aiinolalion  tools  to  facilitate  explorations  of  SciiiEval. 

4,  Extended  the  AXIS  speech  understanding  system  to  the  ATIS3  database, 
increasing  the  vocabulary  to  alxn.'.  3,0'K)  words,  and  participated  in  ilic  .iniuial  AREA 
ei alualiuns.  Tliere  was  a  signilieant  decrease  in  error  late  over  the  presious  year  and 
the  BYBI.O.S  system  again  havl  the  highest  speech  reeogiiiiion  accuracy  in  the 
es  all!  ation. 

Our  research  into  tlie  appropriate  iiitcgiatiun  of  lexical,  syntactic,  and  scnianiic 
know  ledge  produced  a  systcni  that  is  capable  of  using  elTieiently  w  hatever  types  of 
knowledge  will  produce  a  valid  iiilerpreiation  of  an  utterance  in  context.  Syntactic 
knowledge  is  used  if  it  is  available  and  reliable,  but  an  utierance  that  is  outside  the 
scope  of  DF.LPLlH's  gramtiiar  can  still  be  understood  if  phrases  can  be  recogni/'ctl 
and  cotiibitted  semantically. 

(i.  We  developed  a  learnable  tnudel  of  setnatitics,  to  facilitate  the  acquisition  of 
doniain-spccilic  ini'orniation  that  was  formerly  very  labor  imciisive  to  jiroducc. 

7.  At  the  1992  AREA  Speech  and  Natural  Language  Workshop,  BBN  gave  a 
dcnionstration  of  the  lirst  HX)0-wotd,  real-time,  continuous,  speaker  independent 
speech  reeognitioti  systetii  implemented  on  an  off-the-shelf  workstation,  without  any 
accelerator  hoards. 

8.  BBN  also  gave  the  lirst  real-time  dctiioiistration  of  a  complete  spoken  language 
utidcrstanding  system  in  the  AXIS  (Air  Xtavcl  Itiformatioii  System)  domain, 

10.  We  participated  fully  in  the  AXIS  data  collection  effort,  contributing  a  large 
portion  of  the  AXIS  training  and  cvalualion  data. 

1 1.  A  demo  suite,  which  includes  real-time  speech  recognition  and  understanding 
demonstrations,  was  delivered  to  NSA  and  NISX.  Xhc  demos  run  on  the  Silicon 
Graphics  Indigo.  Xhc  real-time  AXIS  system  has  been  used  by  NISX  to  collect  AXIS 
data  from  subjects. 

Drs,  Madeleine  Bates  and  John  Makhoul  were  invited  to  participate  in  the  National 
Academy  of  Sciences  Colloquium  on  Human-Machine  Communication  by  Voice,  Irvine, 
California,  February  8-9,  1993.  Xhey  gave  the  following  presentations:  Madeleine  Bates," 
Models  of  Natural  Language  Understanding.”,  John  Makhoul,  “State  of  the  Art  in 
Contimioub  Speech  Recognition.”,  Xhesc  two  papers  appeared  in  a  book  published  by  the 
National  Academy  of  Sciences! 7,  451. 


BB\'  Syslenis  and  Ti'clnuila^it's  I  I 

W'e  Ikui.’  .iiul  woikcil  closcl>  \v.iih  'l’nci'.iI  1'  XRI’A  ukIo  SLS  program  coniniiilccs, 

111  pariKiilar  ilio  Loniiiiiiiee  iliai  doiermmcs  ilic  nii’ii  uJology  lor  the  ATIS  systems  (the 
MADC'OW  eommiiieel.  Some  of  our  eonti ibulioiis  to  ilic  evaluation  mctliodology  arc 
doeunicnted  in  124,  731. 

Dr.  Madeleine  Rates  was  Chair  of  the  19‘)3  ARI  .'  Human  Language  Technology 
■Worksliop,  which  took  place  on  March  21-24,  1993,  at  tltc  Merrill  Lynch  Conrercncc 
Ccnier.  Plainsboro,  NJ.  (I  l|  BBN  presented  demoii'-i rations  ol  the  BBN  real  time  ATIS 
sysieni  at  this  workshop,  and  ihe  BBN  real  time  A'l  IS  system  wv.s  available  in  the  demo 
room  throughout  the  workshop.  She  also  lo-ehaircd  the  Applied  Natural  Language 
Processing  Conrereiice  in  lienio,  Itaiy  in  .April,  |9'i2. 

We  partici'pated  in  the  annual  ARP.A  Siieech  and  M.  workshops,  and  the  Human  Language 
Technology  workshops,  by  presenting  papers  .uid  going  demonstrations  of  Ihe  technology 
developed  under  this  elTori.  Kererencvs  •  the  papers  from  these  workshops,  as  well  as 
other  presentations  and  papers  produced  under  this  contract,  can  be  found  in  the 
bibliography. 

In  Chapter  2  of  this  document,  we  present  a  general  overview  of  the  BBN  MARC  spoken 
language  understanding  system.  Chapter  3  dcserilK’s  the  Delphi  tuitural  language 
compouent  of  HARC,  and  Chapter  4  details  the  Semantic  Linker  component  of  Delphi. 
Some  of  the  speech  research  carried  out  under  this  contract  is  discus.sed  in  Chapter  5,  and 
our  most  recent  research  in  a  novel  method  of  natural  language  processing  is  pre.seiucd  in 
Chapter  6. 


Chapter  2 


Overview  of  HARC 


2.1  Introduction 


In  this  chapter  wc  tloscribo  the  i.iesign  and  perforniance  of  a  complete  spoken  langtiage 
understanding  system  currently  under  development  at  oiSN.  The  system,  dubbed  HARC 
(Hear  And  Respond  to  Continuous  speech),  successfully  integrates  state-of-the-art  speech 
recognition  and  natural  language  understanding  subsystems.  The  system  has  been  tested 
extensively  on  a  restricted  airline  travel  information  (ATIS)  domain  with  a  vocabulary  of 
over  1000  words.  In  this  application,  the  system  functions  as  an  electronic  airline  guide, 
searching  a  database  to  answer  questions  posed  by  the  user. 

HARC  is  implemented  in  portable,  high-level  software  that  runs  in  real  time  on  today’s 
workstations  to  support  interactive  online  human-machine  dialogs  at  a  very  comfortable 
pace.  No  special  purpose  hardware  is  required  other  than  an  A/D  converter  to  digitize  the 
speech. 

The  system  works  well  for  any  native  speaker  of  American  English  and  does  not  require 
any  enrollment  data  from  the  users.  HARC  has  shown  consistently  high  performance  in 
formal  evaluations  on  the  ATIS  domain. 

The  BBN  HARC  spoken  language  system  weds  two  technologies,  speech  recognition  and 
natural  language  understanding,  into  a  deployable  human-machine  interface.  The  problem 
of  understanding  goal-directed  spontaneous  speech  is  harder  than  recognizing  and 
understanding  read  text,  due  to  greater  variety  in  the  speech  and  language  produced.  We 
have  made  minor  imdifications  to  our  speech  recognition  and  understanding  methods  to 
deal  with  these  variabilities.  The  speech  recognition  uses  a  novel  multipass  .search  strategy 
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ihiii  .illows  l’ixmi  lloxibililv  and  cUicioncy  in  iho  api'lication  of  powerful  knowledge 
^ou^ee.^  The  naluial  language  sysiein  is  based  on  lunnal  linguislic  principles  with 
extensions  to  deal  with  speech  errors  and  to  make  it  robust  to  natural  variations  in 
language.  The  result  is  a  very  usable  system  for  domains  of  moderate  complexity. 

'While  the  techniques  used  here  arc  general,  the  most  complete  test  of  the  whole  system 
•thus  far  was  made  using  the  ATIS  corpus,  which  is  briefly  described  in  Section  2.  Section 

3  describes  the  techniques  used  and  the  results  obtained  for  speech  recognition,  and  Section 

4  is  devoted  to  natural  language.  The  methods  for  combining  speech  recognition  and 
langtiage  understanding,  along  with  result'-  for  the  combined  system  are  given  in  Section  5. 
Finally,  in  Section  6,  we  describe  a  real-time  implementation  of  the  system  that  rtins 
entirely  in  software  on  a  single  workstation. 

More  details  on  the  specilic  techniques  u.sed,  the  makeup  of  the  coipus,  and  the  results  can 
be  found  in  the  papers  presented  at  the  ARI’.A  Workshops  on  Speech  and  Natural  Language 
and  other  meetings  jdl,  2.3,  21, 44,  3.'',  40,  61,  46,  70.  40.  .SO.  62,  49,  II,  48], 


2.2  The  ATIS  Domain  and  Corpus 

2.2.1  ATIS  Domain 

The  Air  Travel  Information  Service  (ATiS)  is  a  system  for  getting  information  about 
flights.  The  inforntation  contained  in  the  database  is  similar  to  that  found  in  the  Official 
Airline  Guide  (OAG)  but  is  for  a  small  number  of  cities.  The  ATIS  corpus  consists  of 
spoken  queries  by  a  large  number  of  users  who  were  trying  to  solve  travel  related 
problems.  The  ATISO  and  ATIS  1  corpora  contain  about  4,000  utlerances,  most  of  wliich 
were  read  sentences,  mostly  by  speakers  with  dialects  from  the  southern  U.S.  The  ATIS2 
training  corpus  consists  of  12,214  spontaneous  utterances  from  349  subjects  (1.59  female, 

1 50  male)  who  were  using  simulated  or  real  speech  understanding  systems  in  order  to 
obtain  realistic  speech  and  language.  The  data  originated  from  5  collection  sites  using  a 
variety  of  strategies  for  eliciting  and  capturing  spontaneous  queries  from  the  subjects  [44], 
with  a  disproportionate  amount  of  the  data  coming  from  MIT. 

1,289  utterances  were  truncated  or  contained  word  fragments  due  to  stuttering.  Many  more 
contained  various  nonspcech  sounds.  There  were  .also  frequent  long  pauses  and  hesitations 
in  the  data.  Each  sentence  in  the  corpus  was  classified  as  class  A  (self  contained  meaning), 
class  D  (referring  to  some  previous  sentence),  or  class  X  (impossible  to  answer  for  a 
variety  of  reasons).  The  speech  recognition  system  was  tested  on  all  three  classes,  although 
the  results  for  classes  A  and  D  were  given  more  importance.  The  natural  language  system 
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,iml  ti  inl'iiKvl  iiiulfiM.iinliitf;  uci.  oicil  im-.K  .mi  il.isst".  A  .iiul  D. 

iIu-\  ul'K'  pn.‘^^.'liU■ll  Willi  .ill  ol  iln‘  k'sl  loiicc'.  iii  ilk’ii  (iiijjiii.il  niiloi. 

2.2.2  Formal  Evaluation  Conditions 

Tlic  November  '92  cvaluiuioa  tesi  sel  lias  data  I'rom  .45  speakers.  The  luiinber  ol'  uiienuices 
per  speaker  varied  I’rom  2  io4l.  but  ihe  number  ol  iilieranees  from  eaeh  of  llie  5 
dala-eolleeiion  sites  ivas  carefully  balanced  All  iv  ulis  given  were  coliccled  sviih  the 
Sennheiser  microphone  isunie  as  the  training  data,  fhe  rccogniiioii  mode  was 
speuker-inde|)eiideiii  -  the  lest  speakers  were  noi  m  ilio  training  set  ,iiid  every  sciiieiice  was 
treated  independemly. 

By  conimitlce  deeisioii  there  was  no  comiiion  baselme  conirol  coiulitioii  for  ihe  iraiiiing 
data  or  speech  grammar  to  be  used  for  the  ATIS  tests.  The  only  constraini  was  that  the 
single  common  test  set  iiuisi  be  used. 


2.3  BYBLOS  -  Speech  Recognition 

BYBLOS  is  a  statc-of-the-ail,  phoiictically-ba.scd,  continuous  speech  recognition  system 
that  has  been  under  development  at  BBN  for  over  seven  years.  This  system  introduced  an 
effective  .strategy  for  using  context-dependent  phonetic  hidden  Markov  models  (HMM)  and 
demonstiated  their  feasibility  for  large  vocabulary,  continuous  speech  applications  125). 
Over  the_  years,  the  core  algorithms  have  been  reftned  primarily  on  artificial  applications 
using  lead  speech  for  training  and  testing.  These  same  basic  algorithms,  with  small 
extensions,  have  proven  to  be  remarkably  suited  to  the  recognition  of  completely 
spontaneous  speech  produced  in  a  goal-directed  task,  such  as  ATIS. 

2.3.1  New  Extensions  for  Spontaneous  Speech 

Spontaneous  queries  sjxiken  in  a  problem-solving  dialog  exhibit  a  wide  variety  of 
disfluencies.  There  were  three  very  frequent  effects  that  we  attenipied  to  solve  - 
excessively  long  segments  of  waveform  w’.ih  no  speech,  poorly  transcribed  training 
utterances,  and  a  variety  of  nonspeeeh  sounds  produced  by  the  user. 

When  background  noise  is  pre.sent,  the  HMM  is  not  a  particularly  reliable  discriminalor  of 
speech  vs.  silence,  and  many  insertion  errors  result.  We  chose  to  find  and  truncate  long 


1# 

p’ .  ’*•  . 

'  S  ' '  . 

•K 

■  • 

V  '  . . 

■  ■  ., 

chm'iik:  < )vi  kvii.w  ()i  ii\Kf 

•• 

h' 

icvi'iii^  I'l  ib'iKiHwh  uiih  ,1  \i'u  .1  N|vi.'ili  ilL'k'LiiM  ili.ii  I'.iii  liLMl  «ilh 

■rk 

lUUsC  hui>l''  I1C«II  lilC  '•pCCclK  1  ik'  N|k\*.h  «I0ICCU»I  N  sC\Ol.n  '‘illlplc  ikiiiptUC 

SN'R-dcivtidcn;  ilciirLimn  thicslmlils 

Typii:all\,  iIicil'  .iic  inan\  uiuiansfiilvil  slum  scpiiKnis  ol  haikgiouikl  silcnac  icmamiiig  iii 

V 

the  wavL'Ioriib  aflor  trunsaling  the  long  ones.  \Vs  lniinu  tliL'in  to  lu'  so  imnicrous  tliai  they 
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int'asurably  ilegradad  ihc  ivrloiinaiicc  gam  usually  lorivcd  Inmi  using 

cross-\soiiJ-hoLiiulai>  iripluinc  IlMMs  We  deNiscil  i  pr<KViluiv  lo  .lulomaiieally  mark  Ihc 
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events,  such  as  pause  lillers  tum's  and  ult'si.  iliio.ii  .  le.inngs,  coughs,  laughier,  ami  heavy 
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breath  iiuise.  We  attempted  to  model  a  dozen  hio.id  classes  ol  nonspeeeh  sounds  that  were 
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both  ptoininent  and  numerous  llowexer,  when  we  .diowed  the  dceoiler  to  lind  nonspeeeh 
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models  between  words,  we  loumi  that  theie  were  nu'ie  false  detections  than  eorreet  ones. 

Beeau.se  our  silence  model  had  liitie  dillieulty  dealing  with  hreuth  noises,  lip  smacks,  and 
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other  noises,  our  Ixist  results  weie  achieved  hy  making  the  nonspeeeh  models  very  unlikely 

in  the  grammar. 
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2.3.2  Forward-Backward  N-best  Search  Strategy 

Iv' - 

The  BYBLOS  speech  recognition  system  uses  a  novel  multi-pass  seareh  strategy  designed 

to  use  progressively  more  detailed  models  on  a  correspondingly  reduced  search  space.  It 

produces  an  ordered  list  of  Ihc  N  top-scoring  hypotheses  which  is  then  reordered  by 

r't  '  ■ 

• 

several  detailed  knowledge  sources.  This  N-best  strategy  [26.  63]  permits  the  use  of 
otherwise  computationally  prohibitive  models  by  greatly  reducing  tlie  search  space  lo  a  few 
(N=20-100)  word  sequences.  It  has  enabled  us  to  u.sc  cross-word-boundary  ir .phone 

models  and  trigram  language  models  with  ease.  Tiie  N-besl  list  is  also  a  robust  interface 

between  speech  and  natural  language  that  provides  a  way  to  recover  from  siieech  errors. 

?•;■■■: 

We  use  a  4-pass  approach  to  produce  the  N-bcsl  lists  for  natural  language  processing. 

TV:‘ 

1.  A  forward  pass  with  a  bigram  grammar  and  discrete  HMM  models  saves  the  top 

word-ending  scores  and  times  [5]. 

'  %  ' 

2.  A  fast  time-synchronous  backward  pass  produces  an  initial  N-besl  list  using  the 

iLi''’  '■  ,- 
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Word-D.npendent  N-best  algorithm  [64], 

'S'  ^  •;'■■ 

r.Y-"- 

:>•  X  V 

1?-  '•».■■ 

l.  .  V 

mi- 

[ 

- - - - -I 

cum'IIh:  (n  i.Rvii  w  di  harc 


i() 


I  .n  il  I'l  llu'  N  st'nlL'ikC  In  n 

.Mill  vCMil  iDiiiinuoin  iloimn  IISINK  .iiul  il.. 


•Mill  ii.". .  '.MMil  huuiki.irs  inplioik's 
hc'.i  li^i  K'ordoiod 


4.  I'hc  N-bcst  list  is  iL'scorcil  wilh  a  iri^irani  }!i.i  iiniar  aiul  iL'('nlcii.’d  again. 


Eacli  uiltrance  is  quaiui/ed  and  decoded  three  liiiie--.  once  wiih  each  gender-dependent 
model  and  once  with  a  gendcr-independeiii  model.  I  or  each  niieiance,  the  N-besi  list  with 
(lie  higlter  top- 1  hypothesis  score  is  chosen.  Tlie  lop  choice  in  ihe  linal  list  ctinstituies  the 
speech  recognition  rosulls  reptnied  helms.  Then  the  entire  Im  is  passed  lo  the  language 
undeisl, lading  component  loi  I'unher  reoidaing  aii.‘  inierpreiaiion. 


1, 

2.3.3  Training  Conditions 

Although  iliere  were  no  reslriclions  on  the  training  data  lo  be  used,  we  used  speech  diila 
I'rom  the  AT1S2  subeorpus  exclusnely  to  train  the  paramelers  ol'  the  acousiic  model.  We 
did  this  because  we  I'elt  that  the  ATIS2  subset  best  lepresemed  the  conditions  of  ihc  test 
and  because  we  I'elt  that  simply  adding  more  training  daia  lo  achieve  incremental 
improvements  is  scieiUilically  uninteresting, 

However,  we  filtered  the  training  data  for  quality  in  several  ways.  We  removed  from  the 
training  any  utterances  that  were  marked  as  truncated,  containing  a  word  fragment,  or 
containing  rare  nonspccch  events.  Our  forward-backward  training  program  also 
aulomaiically  rejects  any  input  that  fails  to  align  properly,  thereby  discarding  many 
sentences  with  incorrect  transcriptions.  These  steps  removed  1,^89  utterances  from 
consideration. 

After  holding  out  the  1001  sentences  of  the  Feb.  ’92  test  as  a  development  test  set,  we 
were  left  with  a  total  of  10925  utterances  for  training  the  HMMs.  For  statistical  language 
mode,  training  wc  used  all  available  (17,313)  senlciice  texts  from  ATISO,  ATISl,  and 
ATIS2  (excluding  the  development  test  sentences  from  the  language  model  training  during 
the  development  phase). 

The  lexicon  used  for  recognition  w'as  initialized  by  including  all  words  observed  in  the 
complete  grammar  training  texts.  Common  closed-class  words  such  as  days  of  the  week, 
months,  numbers,  plane  types,  etc.,  were  completed  by  hand.  Similarly,  we  included 
derivations  (mo.stly  plurals  and  posscssives)  of  many  open-class  words  in  the  domain.  We 
also  added  about  400  concatenated  word  tokens  for  commonly  occurring  sequences  such  as 
WASHINGTON_D.C,  SAN -FRANCISCO,  and  D.C-TEN,  The  final  size  of  the  lexicon 
was  1830  words.  For  the  November  '92  evaluation  test  set  only  57  word  tokens,  covering 
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52  iinaiiic  words,  wore  oui-or-\ocahiil.ii\  (OOVi  '  this  Il-mooii.  This  is  only  ;i  O.O't 
OOV  word  olliiiiciki;  laio  o\or  ilic  whole  icsl  s'.^ 

We  esiiiimied  the  piiraincters  of  our  siatisiical  bii:r,iin  and  trigram  grammars  using  a  new 
hacking-off  procedure  155|  that  is  somewhat  simpler  than  that  of  Katz  1.58).  The  n-grams 
'were  computed  on  word  classes  in  order  to  share  the  \ery  sparse  training.  A  total  of  1090 
■semantic  classes  w'cre  delined  (most  words  remained  smglettms  in  their  class). 


2.3.4  Spcocli  Kcco^iiitioii  Results 

lahic  2.1  shosss  the  ollieial  results  for  BN’lJl.OS  oi;  tins  cs.ihialion.  broken  dovtn  Iw 
utterance  ehiss.  We  also  shove  the  average  iK'ipU  'iiv  of  the  higram  and  trigram  iangtiiige 
models  ;ts  measuieti  vni  the  evaluiition  test  sets  ticn-'ring  oui-uf-voeabularv  words), 


Senicnee 

Digram  ' 

Ingram  1 

eb92-Nuv92-Nov9.3 

('lass  j 
A+D  ; 

I’erplcx 

Perplex 

'!•  Word  lirrors 

17  ! 

;  12  ■ 

6.2-4. .3-.3. 3 

A+D+X 

20  1 

15  ' 

9..4-7.6-4.4 

A  1 

’  1  ^  , 

it)  ■ 

5,8'^Ti-Xoy' 

D  1 

20  1 

14  ; 

7.0-4.8-4,0) 

...,1 

28  1 

L_  .  X. 

17.2-14,5-8.6) 

Table  2.1;  Oflicial  SPRI£C  results  oti  l-eh‘'2.  Nov92,  ai  d  Nov93  test  sets. 


The  word  error  rate  in  each  category  was  lower  than  any  other  speech  system  reportitig  on 
this  data. 

For  a  dtscussion  'c  the.se  results,  the  reader  is  referred  to  .seclioti  5.1.4  of  160]. 


2.4  Delphi  ~  Natural  Language  Understanding  ! 

i 


The  natural  language  (NL)  component  of  HARC  is  the  Delphi  system.  Delphi  uses  a 
definite  clause  grammar  formalism,  augmented  by  the  use  of  constraint  nodes  [69]  and  a 
labelled  argument  formalism  [21].  Our  initial  parser  used  a  standard  context-free  parsing 
algorithm,  extended  to  handle  a  utiification-ba.sed  grammar.  It  was  then  modified  to 
integrate  semantic  processing  with  parsing,  so  that  only  semantically  coherent  structures 
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\\.Hikl  Ih'  I'LkoJ  111  ihe  ssiu.Kiic  1.I1.111  TIk-  \|ii.“til  .iiul  rolnisini’Ns  were  enhanced  by 
ikliiiiy  10  .III  .i^ienda-based  eluin-paiser.  uuli  scheduling:  depeiuiiiii:  on  ihe  measured 
stalisiie.d  likeliluuid  (il  gi'aniniaiieal  lutes  |2()|  This  greatly  rediieed  the  search  space  lor 
Ihe  hesi  parse. 


.The  most  recent  version  of  Delphi  includes  chan-'cs  to  the  syniactie  and  seniaiuie 
•  components  that  maintain  the  tight  syniactic/semantie  coupling  characteristic  of  earlier 
versions,  while  allowing  the  system  to  provide  semantic  interpr>''ations  of  input  which  has 
no  valid  global  suiiactic  analysis  This  included  the  develojsnieni  of  a  "fallback 
component"  |2.T  7()1,  m  which  siaiisiical  esiimales  play  an  important  role.  I'his  componeni 
.ilhiws  Delphi  III  de.il  elleeiisels  with  Imguisnealls  ill-formed  inputs  that  .iiv  common  in 
spiMUiiieoiis  speech.  ,is  well  ,is  with  the  word  errois  produced  In  the  speech  leeogni/er 


2.4.1  S*ar,siiii;  a.s  Tran.sducUon  -  (Jram.iiafical  Relations 


The  Delphi  parser  is  not  a  device  loi  coiisiructing  svntactic  trees,  but  an  inlormuiion 
transducer.  .Semantic  interpretation  is  a  process  operating  on  a  set  of  messages 
charactcri/ing  ItK'al  "grammatieal  relations"  among  phrases,  rather  than  as  a  recursive  tree 
walk  over  a  globally  complete  and  ctihcrent  par.se  tree.  The  grammar  has  been  reoriented 
around  local  grammatical  relations  such  as  deep-structure  subject  and  objecl,  us  well  us 
other  udjunet-like  relations.  Tlie  goal  of  the  panscr  is  to  make  these  local  grammatical 
relations  (which  arc  priinurily  encoded  in  o.dcring  and  constituency  of  phrases)  as  readily 
available  to  the  semantic  interpreter  as  information  explicitly  encoded  in  the  words 
themselves. 


From  the  fwint  of  view  of  a  synlaciic-.semaiuie  transducer,  the  key  point  of  any 
grammatical  rcladon  is  that  it  licenses  a  small  number  of  semantic  relations  between  the 
"meanings"  of  the  related  con.stiiuents.  Sometimes  the  grammatical  relation  consirains  the 
semantic  relation  in  ways  that  cannot  be  predicted  from  the  semantics  of  the  constituents 
alone  (e.g.  Given  “John",  "Mary",  and  "kissed",  only  the  grammatical  relations  or  prior 
world  knowledge  determine  who  gave  and  who  received).  Other  times  the  grammatical 
relation  simply  licenses  the  only  plausible  semantic  relation  (e.g.,  “John",  “hamburger", 
and  “ate”).  Finally,  in  sentences  like  "John  ate  the  fries  but  rejected  the  hamburger", 
semantics  would  allow  the  hamburger  to  be  eaten,  but  our  knowledge  of  its  destiny  is 
mediated  by  its  lack  of  any  grammatical  relation  to  “ate". 


Grammatical  relations  are  expressed  in  the  grammar  by  giving  each  element  of  the  right 
hand  side  of  a  grammar  rule  a  grammatical  relation  as  a  label.  A  typical  rule,  in  schematic 
form,  is: 


. ...... 'lii..  *,, ...  -c-'i  v-  '»«’ 7--  .* -a 
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(NP  )  •  iHEAD  (NP  .  )  :PP-COMP  (PP  : PREP  ) 


uliicli  says  ihui  a  noun  phrase  followed  by  a  prepi)sitional  phrase  provides  evidence  for  ihe 
relation  I’P-COMP  iieiween  the  PP  and  IIHAD  of  the  NP. 

One  of  the  right-hand  elements  must  be  lalxded  the  "head"  of  the  rule,  and  is  the  initial 
source  of  information  about  the  semantic  and  syntactic  “binding  state”  which  controls 
whether  other  elements  of  the  nglit  haiul  side  can  '■•’md"  to  ihe  head  via  their  lalseled 
relation 

This  view  made  it  possible  to  both  ilccieasc  the  mmilier  ot  grammar  rules  (from  1 1  to 
anil  mcrease  sMitactic  coverage.  Most  aitachmenis  can  Iv  modelled  by  simple  binary 
adjunction,  and  since  the  detail'  of  the  syiii.ictic  tree  structure  arc  not  ceniial  to  a 
transducer,  each  aditmct  c;m  be  seen  .i>  heir.g  "logicMlIy  atuiched"  to  the  ''head"  of  the 
constituent. 

In  effect,  we  factored  single  grammar  rules  'that  produced  ordered  sequences  of 
constituents)  into  stnaller  binary  adjunction  rules  that  can  be  combined  together  in  various 
ways;  ordering  constraints  on  these  adjuncts  are  provided  by  lexical  semantic 
well-formedness  rules.  For  example,  rather  than  using  sutKaiegori/.aiion  features  to  name 
sets  of  categories  that  appear  together  as  complements  of  a  verb,  we  have  delined 
approximately  1.5  verb  phrase  rules  that  list  the  possible  constituents  that  may  apixtar  us 
complements  to  a  verb.  These  may  enilicd  within  each  other  freely,  so  long  us  the  results 
are  semantically  intcrpretahle  by  the  head.  Similai  recursive  binary  complement  and 
adjunct  rules  are  provided  for  noun  phra.ses.  This  recursive  scheme  allows  the  adjunction 
rules  of  the  grammar  to  be  combined  together  in  novel  ways,  governed  by  the  lexical 
semantics  of  individual  words.  The  grammar  writer  dws  not  need  to  foresee  all  possible 
combinations. 


2.4.2  “Binding  rules”  -  the  Semantics  of  Grammatical  Relations 

The  interface  between  parsing  and  semantics  is  a  dynamic  process  structured  as  two 
coroutines  in  a  cascade.  The  input  to  the  semantic  interpreter  is  a  sequence  of  messages, 
each  requesting  the  semantic  "binding”  of  s'^mc  constituent  to  a  head.  A  set  of  “binding 
rules”  for  each  grammatical  relation  licenses  the  binding  of  a  constituent  to  a  head  via  that 
relation  by  sixicifying  the  semantic  implications  of  binding.  These  rules  specify  features  of 
the  semantic  structure  of  the  head  and  bound  constituent  that  must  be  true  for  binding  to 
take  place,  and  may  also  specify  syntactic  requirements.  Rules  may  also  allow  certain 
semantic  roles  (such  as  time  specification)  to  have  multiple  fillers,  while  other  roles  may 
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.llli)\\  jll''!  one  lilk'l. 

A^  adjuiKis  arc  added  to  a  stiuciurc,  the  hiitdiiig  lisi  is  coiiiliiionally  extended  as  long  as 
semantic  eolierencc  is  maintained.  When  a  eonstitiieiii  is  syntaciieally  complete  (i.e.,  no 
tnofc  adjuncts  are  to  lx'  added],  Deljihi  evaluates  rules  tliat  check  for  semantic 
Completeness  and  produce  an  "inierprelation"  of 'he  erri'stituent. 


2.4.3  Kobu.stne.ss  Ha.scd  on  Statistics  and  Semantics 

rid'oitunalcly,  simply  h.iving  a  iiansduction  system  aith  semtutties  based  on  grammatical 
relations  docs  not  deal  directly  \sith  the  key  rssue  <>l  rohustness  -  the  ability  to  make  sense 
of  an  input  esen  if  it  cannot  be  assigned  a  well-foimed  gltrhal  syntactic  atralysis.  The 
dil'Iiculiy  with  standard  syntactic  techniques  is  that  local  syntactic  evidence  is  not  enough 
to  accurately  dcterniine  gramtn.iiieal  relations.  A  .Sl>  followed  by  a  verb  may  be  the 
subject  of  that  veil)  ("JtlllH  Itvyw  to  IJoston")  or  may  he  unrelated  (“The  man  I  itiiroduccd  to 
John  lle.w  to  Boston"),  The  standard  solution  is  to  Imd  a  global  parse,  which  provides  the 
necessary  conlirming  evidence  lor  the  local  relations  it  contains. 

In  Delphi  we  view  standard  global  parsing  as  merely  oi.s  way  to  obtain  evidence  for  the 
existence  of  the  grammatical  relatmns  in  an  input  string.  Dclphi’.s  strategy  is  based  on  two 
other  sources  of  information,  Delphi  applies  semantic  constraints  incrementally  during  the 
parsing  process,  so  that  only  semantically  coherent  grammatical  relations  arc  considered. 
Additionally,  Delphi  has  statistical  information  on  the  likelihood  of  various  word  sen.ses, 
grammatical  rules,  and  grammatical-semantic  transductions.  Thus  Delphi  can  rule  out 
many  locally  possible  grammatical  relations  on  the  basis  of  semantie  incohcrcnec,  and  can 
rank  alternative  local  structures  on  the  basis  of  empirically  measured  probabilities.  The  net 
result  is  that  even  in  the  absence  of  a  global  parse,  Delphi  can  quickly  and  reliably  produce 
the  most  probable  local  grammatical  relations  and  semantic  content  of  various  fragtnenis. 

Delphi  first  attempts  to  obtain  a  complete  syntactic  analysis  of  its  input,  using  its 
agenda-based  best-first  parsing  algorithm.  If  it  is  unable  to  do  this,  it  uses  the  parser  in  a 
fragment-production  mode,  which  produces  the  mo,st  probable  structure  for  an  initial 
segment  of  the  input,  then  restarts  the  parser  in  a  top  down  mode  on  the  first  element  of 
the  unparsed  string  whose  lexical  category  provides  a  reasonable  tv  -hor  for  top-down 
prediction.  This  process  is  repeated  until  the  entire  input  is  spanned  witli  fragments. 
Experiments  have  shown  that  the  combination  of  .stat'.slical  evaluation  and  semantic 
constraints  produces  cliunks  of  the  input  that  arc  very  useful  for  interpretation  by 
non-syntactically-driven  strategics. 


riiM’ir.R  :  overview  oe  harc 


2.4.4  \cl\  iiii(»ncs  ol' Delphi's  Appioiicli 

The  scparaiion  of  syniaciic  grammar  rules  from  sem.miic  bimliiig  and  eompiciion  rules 
greatly  laeilitales  fragmem  parsing.  While  it  allows  syntax  and  semanties  to  he  strongly 
coupled  in  terms  of  processing  (par.dng  and  semantic  inicipretaiion)  it  allows  them  to  be 
.es.sentially  decoupled  in  terms  of  notation.  This  makes  the  grammar  and  the  semantics 
considerably  easier  to  modify  and  maintain. 

We  believe,  howescr,  that  in  the  long  term  the  mosi  im]K)riam  adstmtage  is  that  separating 
syniaciic  rules  from  scmaniic  hmding  Ic.id .  as  to  .1  new  kind  iif  l.ingiiagc  model,  based  on 
gramm.iiic.il  rcl.iiions,  m  u.luch  ssmaciic.  scm.mtic  .md  lexical  knowledge  can  he  aci.|tiircd 
by  largely  automatic  means.  We  xicw  the  role  ol  ihe  gi.immar  as  coililying  the  w;iy  that 
tree  structure  provides  evidence  for  giammaiic.il  lel.iiioiis  The  sepaiiiiion  between  rule 
tyi>cs  will  allow  us  for  the  lirsi  time  to  consider  the  effect  of  giaimmitical  relations  on 
meaning,  mde|\'iulenily  of  the  way  that  evidenee  lor  these  relations  is  produced  by  the 
parser. 

One  elfect  of  this  approach  is  to  make  it  |)ossible  to  u.se  a  hy|iothesi/.ed  semantic 
interpretation  of  a  set  of  tree  fragments  to  generate  a  new  syntactic  rule.  Thus,  in  normal 
o|K'ralion.  the  primary  evidence  lor  a  grammatical  relation  is  the  result  of  actually  parsing 
|)art  of  an  in|iui.  However,  since  grammaiieal  relations  isctweeii  constituents  entail 
.semantic  relations,  if  we  can  make  an  estimate  of  the  likelihood  of  certain  semantic 
relations  based  on  domain  knowledge,  pragmatics,  and  task  models,  etc.,  it  is  in  principle 
possible  to  use  abdueiive  reasoning  to  suggest  likely  grammatical  relations,  and  thereby 
automatically  propose  new  grammar  rules. 


2.5  Combined  Spoken  Language  System 


Figure  2.1  shows  the  components  of  the  entire  spoken  language  system. 

The  basic  interface  between  BYBLOS  and  Delphi  in  HARC  is  the  N-bcsi  list.  In  the  most 
basic  strategy,  we  allowed  the  NL  component  to  search  arbitrarily  far  down  the  N-best  list 
until  it  either  found  a  hypothesis  that  produced  a  database  retrieval  or  reached  the  end  of 
the  N-best  list.  However,  we  have  noticed  in  the  pa.st  that,  while  it  was  beneficial  for  NL 
to  look  beyond  the  fir.st  hypothesis  in  an  N-besl  list,  the  answers  obtained  by  NL  from 
speech  output  tended  to  degrade  the  further  down  in  the  N-best  list  they  were  obtained. 

For  the  1992  evaluation,  we  optimized  both  the  depth  of  the  search  that  NL  performed  on 
the  N-best  output  of  speech  and  how  wc  used  a  number  of  fall-back  strategies  for  NL  text 
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Figure  2.1;  BBN’s  SLS  AXIS  System 
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piocc''siim  123|.  roiitul  ihiU.  uiNcn  the  current  ■  lorinunee  iil  iill  (he  eoiuponeiits,  the 
iiptiiu.il  luiiiiber  of  Inpotheses  to  eons'  ler  seas  N-  I-uriliennore,  we  lotind  lliai  ratlicr 
than  applying  the  lall-back  incchanisin  to  each  of  ilicsc  hypotheses  in  turn,  it  was  better  to 
make  one  pass  thiougli  the  N-West  hypotheses  using  ihc  full  parsing  strategy,  anti  then,  il 
no  sentences  were  accepted,  make  another  pass  using  the  lall-back  strategy. 


2.6  Interface 


The  iMterlaee  to  the  llBN  IIARC'/ATI.S  system  consists  of  a  lew  huge  control  buttons,  a 
seqiienee  of  status  lights,  and  a  series  of  display  windows. 

The  control  buttons  allcAV  the  user  (via  the  muu.rc.  ;■>  command  the  system  to  begin  to 
listen  for  a  spoken  question  or  comtnand.  to  cancel  a  query  in  progress,  to  provitie  help,  to 
clear  the  conte.xi,  ttnd  to  display  another  window  in  which  the  user  may  set  various  system 
parameters  or  quit. 

The  status  lights  indicate  whether  the  system  is  Ready  (not  listening,  but  ready  for  the  user 
to  click  the  Listen  button  and  talk).  Listening  (microphone  on).  Recognizing  (BYBLOS 
speech  recognition  in  progress)  Understanding  (the  Delphi  language  understanding  system 
in  progress,  followed  by  the  translation  of  the  understood  utterance  to  daiabasc  commands), 
and  Retrieving  (database  retrieval). 

The  display  windows  include  the  results  of  the  speech  recognition  process,  Delphi  s 
paraphrase  of  the  meaning  of  the  utterance,  the  data  (usually  a  table)  retrieved  from  the 
databa.se,  ’and  the  context  that  is  maintained  from  one  query  to  the  next. 

Figure.s  2.2  through  2.8  show  the  BBN  ATIS  .screen  after  various  kinds  of  queries  or  other 
user  actions. 


2.7  Results 

In  Table  2.2  wc  show  the  official  performance  on  the  November  '92  evaluation  data  as 
calculated  by  NIST.  The  percent  correct  and  the  weighted  cnor  rate  is  given  for  the  Delphi 
system  operating  on  the  transcribed  text  (NL)  and  for  the  combined  HARC  system  (SLS). 
The  results  are  shown  for  clas.ses  A-t-D  combined  and  separately.  The  weighted  error 
measure  weights  incorrect  answers  twice  as  much  as  no  answer. 

weight  .1  error  (WE)  =  2  x  %incorr.ct  answers  +  7ono  answers 
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Figure  2.2:  A  Spoken  Imcraeiion  with  lire  BBN  HARC7AT1S  System 


Figure  2.3:  BBN/ATIS  An.swers  a  Question  about  Ground  Transportation 
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#10.  LATE  THURSDAY  AFTERNOON  WASHINGTON  BOSTON  CHEAPEST 


Her'O  are  the  fihhts  from  Washington  to  Boston  on  Thursday  that  leave  between 
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Figure  2.4:  Uilcraiiecs  Need  Not  be  Coniplelc  or  Graiiimutical 
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Figure  2.5:  Using  Prior  Context,  3BN/ATIS  An.swcrs  a  Complex  Query 
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Figure  2.6;  BBN/ATIS  Provides  a  Full  Vocabulary  List 
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Figure  2.7:  BBN/ATIS  Help  for  the  User 
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Figure  2.8;  The  “Other..."  WiiuUnv  Lets  I'ser  Change  System  l\n*ainclcrs 
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Tabic  2.2;  VrConccl  and  V/cigIned  error  on  the  November  '92  test  set. 


The  weighted  error  on  conlcxi-depcndent  sentences  (D)  is  itbocl  twice  that  on  sentences 
that  stand  alone  (A).  This  higher  error  rate  results  iVom  two  phenomena.  First,  it  is  often 
difficult  to  resolve  references  correctly  and  to  know  how  much  of  the  previous  constraints 
are  to  be  kept.  Second,  in  order  to  understand  a  context-dependent  sentence  correctly,  wc 
must  correctly  understand  at  least  two  sentences,  which  is  less  likely. 

The  weighted  error  from  speech  input  is  from  8%-10%  higher  than  from  text.  However, 

■  the  difference  is  lower  than  might  be  expected.  Even  though  the  BYBLOS  system 
rnisrecognized  at  least  one  word  in  25%  of  the  uaeiances,  the  Delphi  system  was  able  to 
recover  from  most  of  these  errors  through  me  use  of  the  N-best  list  and  fallback  processing. 

The  official  SLS  result  for  HARC  was  a  weighted  eiror  of  30.6%.  This  represents  a 
substantial  improvement  in  performance  over  the  weighted  error  during  the  previous 
(February  ’92)  evaluation,  which  was  43.7%.  It  was  the  third  best  overall  result  for  a 
.  spoken  language  system  of  the  seven  sites  participating.  Based  on  end-to-end  tests  with 
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rciil  users,  ilie  sysicni  is  usable,  given  that  subjeci-  .ere  able  lo  aecomplisli  vheir  assigned 
tasks,  flosvever,  wo  believe  that  the  largest  remaiiiii..:  improvenienl  will  not  be  from  speech 
modeling  or  basic  natural  language  understanding,  ''‘it  from  more  careful  task  modeling. 

After  the  1992  evaluation,  the  official  ARPA  scoring  metric  was  changed  from  weighted 
error  to  simple  unweighted  error  (percent  incorrect  +  percent  unanswered). 

The  results  for  the  December  1993  ARPA  evaluation  arc  given  in  Table  2  3. 


Corpus  j  NL  Cor  j 

1  NL  En- 

■SI.S  Cor  i 

SLS  Eirl 

!  A+D  1 

85.3 

1  14.7 

7.5^4  1 

17.5  j 

A 

90.4 

9.6 

H6.2  1 

13.8  1 

'  D 

78.1 

21.8 

77.5  , 

>2.5  j 

Table  2.3:  Correct  and  Simple  Error  on  the  December  '93  test  set. 


2.8  Real-Time  Implementation 


A  real-time  demonstration  of  the  entire  spoken  language  system  described  above  has  been 
irnplem ’nted.  The  speech  recognition  was  performed  using  BBN  HARK^^,  a 
commciciaily  available  product  for  continuous  speech  recognition  of  medium-siicd 
vocabularies  (about  1  .CKK)  words).  HARK  stands  for  High  Accuracy  Recognition  Kit. 
HARK^^  (not  to  be  confused  with  HARC)  has  essentially  the  ;ame  recognition  accuracy 
as  BYBLOS  but  can  run  in  real-time  entirely  in  software  on  a  workstation  with  a  built-in 
A/D  converter  (e.g.,  SGI  Indigo,  SUN  Sparc,  or  HP715)  without  any  additional  hardware. 

The  speech  recognition  displays  an  initial  answer  as  soon  as  the  user  stops  speaking,  and  a 
refined  (rescored)  answer  within  1-2  seconds.  The  natural  language  system  chooses  one  of 
the  N-best  answers,  interprets  it,  and  computes  and  displays  the  answers,  along  with  a 
paraphrase  of  the  query  so  the  user  can  verify  what  question  the  system  answered.  The 
total  response  cycle  is  typically  3^  seconds,  making  the  system  feel  extremely  responsive. 
The  error  rates  for  knowledgeable  interactive  users  appears  to  be  ;..jch  lower  than  those 
reported  above  for  naive  noninteractive  u.sers. 
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2.9  Summary 


'll  '.his  chaptci',  we  have  deseiibcd  the  HARC  spoken  language  inidcrsianding  sysleiii. 
HARC  consists  of  a  niodulai  inlegration  of  the  BYBLOS  speech  rccogniiioii  system  vvitli 
the  Delphi  natural  language  understanding  system.  The  two  components  are  integrated 
using  the  N-best  paradigm,  We  have  shown  that  this  paradigm  is  a  modular,  efficient,  and 
effective  method  for  integrating  speech  ccocnition  and  language  understanding 
components.  In  addition,  the  N-best  strategy  was  shown  to  he  useful  within  the  speech 
recogtiition  system  as  a  means  of  applying  espensise  knowledge  stiuiccs,  such  us 
cross-word  .icoustic  models  and  trigram  language  models.  The  entire  system  has  been 
i.nplemcntod  tti  run  in  real  time  on  a  standard  workstation  without  the  neetl  for  any 
additional  hardware. 


Chapter  3 

The  Delphi  Natural  Language 
Understanding  System 


3.1  Introduction 


This  chapter  presents  Delphi,  the  itatmul  language  component  of  tnc  BUN  Spoken 
Language  System.  Delphi  is  a  cloniain-indejjciijcni  imteral  language  question  answering 
system  that  is  solidly  based  on  linguistic  principles,  yet  which  is  also  robust  to 
ungrammatical  input.  It  includes  a  domain-independent,  broad-coverage  grammar  of 
Engli.sh.  Analysis  components  include  an  agenda-based  bcsi-hrst  parser  and  a  fallback 
component  for  partial  understanding  that  works  by  fragment  combination.  Delphi  has  been 
formally  "evaluated  in  the  ARPA  Spoken  Language  program’s  ATIS  (Airline  Travel 
Information  System)  domain,  and  has  performed  well.  Delphi  has  also  been  ported  to  a 
spuken  language  dcmonstr.ntion  system  in  an  Air  Force  Resource  Management  domain.  Wc 
discuss  results  of  the  evaluation  as  well  as  the  porting  process. 

Delphi  is  a  natural  language  understanding  system  based  on  general  linguistic  principles 
which  is  adaptable  to  any  question-answering  domain.  It  incorporates  a  number  of 
domain-independent  knowledge  bases,  including  a  general,  broad-coverage  grammar  of 
English  with  a  powerful  and  flexible  handling  of  complementation.  'Jnlike  most  other 
linguistically  motivated  systems,  liowevcr,  Delphi  is  also  highly  robust,  allowing  for  partial 
understanding  when  an  input  is  ungrammatical,  disfluent,  or  not  properly  iranscribed  by  a 
speech  recognizer.  Thus,  Delphi  can  be  used  for  a  spoken  language  application  as  readily 
as  for  a  written  one.  Furthermore,  Delphi’s  partial  understanding  component,  called  the 
■  Semantic  Linker,  is  driven  off  the  same  system  of  semantic  rules  as  Delphi’s  regular 
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b'isi-lirsi  parser.  Building  a  robusi  applicaiion  thcielorc  reciuires  no  atklilionul  effort. 

There  arc  several  components  of  the  system,  which  is  diagramed  in  I'igure  .^.1. 

First  arc  the  parser  and  Semantic  Linker,  which  output  an  intermediate  rcprc,scntation  we 
call  a  “semar.tic  graph".  The  semantic  graph  is  passed  to  a  quantification  .stage  which 
produces  a  fully  scoped  logical  form  from  it.  The  Irjgical  form  is  then  passed  to  the 
discourse  stage,  which  resolves  pronominal  references  and  performs  other  types  of 
task-dependent  constraint  resolution  to  produce  the  linal  logical  form.  The  final  logical 
form  is  then  passed  to  the  backend  translator,  and  then  to  the  applicaiion  system  which 
produces-ihc  response.  Several  knowledge  bases  are  employed  by  these  analysis 
components,  including  grammar,  "realization  rules"  and  the  domain  model,  wliich 
represents  the  set  of  classes  and  binary  relations  of  the  given  applicaiion  domain. 

Delphi  differs  from  most  other  linguistically  motivaied  sy.stems  in  the  role  that,  is  played  by 
syntax.  The  primary  function  of  Delphi’s  parser  and  syntactic  knowledge  bases  is  not  to 
produce  a  parse  tree,  but  rather  to  constrain  the  search  for  an  appropriate  semantic  graph 
interpretation  of  the  utterance,  Semantic  graphs  arc  produced  not  by  rulc-lo-rulc 
compositionalily,  but  by  wliat  might  be  called  '‘relation-to-rclation"  compositionaliiy  -  the 
association  of  grammatical  relations  in  the  syntactic  structure  with  semantic  relations  in  the 
semantic  graph. 

This  more  incremental  view  of  the  syiuax/scmuntics  interface  has  three  crucial  advantages. 
First,  there  is  much  more  flexibility  with  respect  to  ordering  and  optionality  of  constituents. 
Second,  'oecause  rclalion-to -relation  .ranslaiions  are  simple,  the  task  of  porting  the  system 


CHAPTER  J.  THE  DELPHI  NATURAL  lANGU  •7-;  LINDERS  I  ANDINCi  SYSTEM 


3: 


IS  grcaily  slmpliliciJ.  Iliird  aiui  linalls.  pailial  or  incmaiA  analyses  can  he  represeiued, 
and  a  eoniplete  seniaiuic  graph  imcrpreiaiion  lor  i..  iiiieranee  prodiieed  even  when  a 
eonipleie  synlaciic  analyses  is  not  available. 

In  the  remainder  of  this  cliapter,  we  describe  Dc'pin's  main  proeessing  eoniponcnis, 
representational  formalisms,  and  knowledge  base  .. 


3.2  Grammar  And  The  Syntax/Semanties  Interface 

Tlte  Delpfii  grammar  is  a  broad  eoverage,  domain  mdcpendeni  grammar  of  Linglish  written 
in  a  version  of  the  Uefmite  Clause  Grammar  formalism  15.3)  that  has  been  exieiuled  to 
inehkie  labeling  of  right-hand  side  elen..  is  with  the  grammatical  relations  they  hear  to  the 
head  of  the  construction.  An  example  is: 

(S  7ara  Tmood) 

-  > 

aubjacti  (NP  7ar0  ?moocl  ate.) 
haadi  (VP  ?agr  Tmood  etc.) 


In  this  rule,  there  is  a  head  VP  and  an  NP  which  bears  the  SUBJECT  relation  to  it.  Other 
grammatical  relations  include  the  familiar  DIRECT-OBJECT  and  INDIRECT-OBJECT  as 
well  as  the  prepositions,  such  as  TO,  FROM,  WITH  and  so  on. 

Annotating  sub-constituents  with  graininatical  relations  regulari/,cs  the  syntactic  structure 
with  respect  to  particular  grammatical  rules,  and  allows  a  “relation-to-rclation’'  form  of 
compositionality,  as  oppo.scd  to  the  more  traditional  ‘'ruic-lo-rulc  ’  version  that  is 
cxen.(dified  by  such  systems  as  Gemini  [28|  and  the  Core  Language  Engine  [2],  in 
relation-to-rclation  compositionality,  each  grammatical  relation  in  the  syntactic  structure 
corresponds  to  a  semantic  relation  in  a  parallel  semantic  structure  we  call  a  "semantic 
graph”.  The  terminal  nodes  of  the  semantic  graph  are  the  word  meanings,  corresponding  to 
the  lexical  heads  of  syntactic  structure. 

'  An  example  of  a  semantic  graph,  representing  the  meaning  of  "What  flights  fly  from 
Boston  to  Denver”,  may  be  .seen  in  Figure  3.2.  The  semantic  graph  is  not  a  fully  quantilied 
formula;  rather  it  may  be  thought  of  as  a  form  of  predicate-argument  representation,  with 
quantifiers  in  place,  from  which  a  fully  quantified  formula  can  be  generated.  The  allowed 
class  and  relation  labels  come  from  the  domain  model. 
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Figure  3.2;  Scmaiuic  Graph 


This  \  iew  of  the  syiUux/seiuamics  interface  has  niarketl  advantages.  lT)i'  unc  tiling,  heeause 
the  syntaetie/seniantic  structure  is  huilt  up  one  argument  at  a  time,  it  becomes  much  easier 
to  .iccomodate  such  phenomena  as  order-vaiiation  aial  optional ity  of  arguments  that  are 
difficult  for  other  a|iproaches 

The  importance  of  this  feature  may  I"  seen  in  the  examples  of  argument  order-variation 
and  optionality  that  abound  in  real  data.  Consider  the  following  from  the  ATIS  domain,  iti 
which  completnetils  can  vary  freely  in  order: 


Whitt  JliglUs  Jly  from  lio.stoii  lit  Denver'.’ 
Wluti  flighi.v  Jly  10  Denver  from  Boston'.’ 


or  be  separated  from  the  head  by  a  tnodilier  typically  regarded  as  att  adjunct: 

Whnt  flights  fly  at  3  pm  from  Boston  to  Denver'/ 

In  some  cases,  modifiers  cati  be  omitted,  as  in; 

What  flights  fly  from  Boston  / 

What  flights  Jly  to  Denver? 

and  sometimes  the  omission  of  an  argument  can  have  anaphoric  coi'sequcnccs,  as  in: 
What  restrictions  apply? 

which  cannot  be  felicitously  uttered  except  in  a  context  where  there  is  something  in  the 
discourse  that  a  restriction  could  “apply”  to. 
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Coincniioiiul  appioachcs  to  subcaiegoil/.iiiion.  sucit  as  Dciiiiite  Clause  Grainnuii'  153], 
Caiegonul  Graniniar  1 1  ],  I’ATR-II  |66|,  an<l  Icxicali/.eil  TAG  1581  all  deal  with 
eonipleuieiuatioti  by  including  iit  one  forni  or  another  a  notion  of  “subcategorization 
fi'anic"  that  specifies  a  sequence  of  coinplcnient  phrases  and  constraints  on  them.  Handling 
.all  the  possible  variations  in  coniplenient  distribution  in  such  fornialisins  inevitably  leads  to 
an  explosion  in  the  number  of  such  frames,  and  a  correspondingly  more  diflieult  tat.k  in 
porting  to  a  new  domain. 

In  our  approach,  on  the  other  hand,  it  becomes  possible  to  view  subcategorization  of  a 
le.vical  item  as  a  set  of  constraints  on  the  outgoing  arcs  of  its  semantic  graph  luide. 
DilTerem  types  of  constraints  -  order  of  argtmicnts,  optiomdity  of  arguments, 
semantic-class  constraints  and  semantic  eflects  of  arguments  -  can  all  be  represented 
separately,  instead  of  enuitierating  all  possible  argument  scqtiences  in  a  set  of  alternative 
siibcategorization  frames. 

Subcategorization  coiistrainis  in  Delphi  are  encoded  in  lexical  eturies  using  a  striiett  re 
called  a  “map''  |711.  Below  is  jiart  of  the  lexical  entry  for  "lly''  in  the  A'fl.S  ilomain: 


FLY 

■ubj«ct!  FLIQKT-OF 
tot  DESY-OF 

£romt  ORXG-OF 

cotaplotioni  (and  (ilillad  Clight-of) 

(or  (filled  dest-of) 

(filled  orig-of)) 

Map  entries  have  “translation",  "realization"  and  "completion”  components.  The  translation 
pari  of  this  entry  5|>ecifies  that  the  lexical  head  "ny”  is  to  correspond  to  a  semantic-graph 
node  labeled  with  event-class  FLY.  The  realization  part  of  the  cnti7  specifics  what 
grammatical  relations  the  lexical  item  takes,  and  what  semantic  relations  these  correspond 
to,  or  "realize",  in  the  semantic  graph.  Here,  the  entry  specifics  that  “fly"  takes  SUBJECT, 
TO,  and  FROM  complements,  and  that  these  grammatical  relations  correspxmd  to  the 
semantic  relations  FLIGHT-OF,  DEST-OF.  and  ORIG-OF  respectively.  Semantic 
sclcclional  restrictions  in  these  argument  positions  -  that  the  filer  of  DEST-OF  be  a  city, 
for  example  -  are  implicit  from  the  declarations  of  the  relations  in  the  domain  model. 

The  "completion”  part  of  the  entry  s[)ecifics  what  outgoing  arcs  are  required  foi  the  node. 
Here,  the  entry  requires  that  the  FLlGHT-OF  role  be  filled,  and  that  either  the  DEST-OF  or 
ORIG-OF  roles  be  filed  (forbidding  the  intransitive  "the  flight  flics”).  More  complex 
optionality  cases  are  encoded  with  other  completion  predicates.  For  example,  the  case 
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wlnMC  >in  ;inaplioi'  inusi  be  proseiu  (“Whai  restricli"iis  apply"!  is  encoded  by  ihc  iJiedicale 
ni.Ll'.D-OR-ANAPHOR. 

Some  icidizatinn  rules  are  lied  to  semantic  classes  rather  than  lexical  translations,  and 
require  for  their  application  only  that  semantic  class  restrictions  implicit  from  the  domain 
and  range  of  the  realized  relation  be  saiislied.  Typical  examples  arc  the  rules  governing 
‘noun  modilier  meanings,  such  as  “Della  nights",  “Delta’s  flights",  “the  llights  on/aboard 
Delta".  These  would  all  be  handled  by  the  global  realization  rule: 

{NOM-COMP  PO.SS  ABOARD  ON  ...} 

— » 

AlRLlNli-OP 


Determining  what  semantic  relation  a  given  grammatical  relation  instance  corresponds  to  is 
most  generally  viewed  as  a  form  of  goal-solving  in  Delphi,  in  which  a  chain  of  rules  cun 
be  invoked.  For  example,  syniuetic  constructions  such  as  “X  with  Y",  "X  has  Y"  and  “X's 
Y"  are  interpreted  by  (irsl  appealing  to  a  rule  mapping  them  to  a  pseudo-relation  culled 
GENERALIZED-POSSESSION,  and  then  seeking  a  realization  for  it  that  is  compatible 
with  the  classes  of  X  and  Y.  This  avoids  having  to  write  throe  different  versions  of  the 
same  realization  rule. 

An  important  advantage  of  the  realization  rule  formulation,  apart  from  its  its  power  and 
flexibility,  is  its  simplicity.  Realization  rules  arc  very  simple  to  write,  and  make  maximal 
use  both  of  knowledge  about  the  domain  and  genera)  knowledge  of  language. 


3.3  111-Formedness  Handling:  The  Semantic  Linker 

When  an  utterance  cannot  be  parsed  with  Delphi’s  best-first  parser  [20]  -  cither  because  it 
is  ill-formed,  mis-recognized  by  the  speech  system,  or  simply  because  it  is  outside  the 
.  coverage  of  the  grammar  -  it  can  still  be  partially  understood  by  the  system,  often  well 
enough  to  give  the  correct  response.  The  component  responsible  for  partial  understanding 
in  the  Delphi  system  is  called  the  Semantic  Linker  (70). 

After  a  parse  fails  there  is  a  set  of  fragmentary  constituents  left  over  in  the  chart, 
corresponding  to  a  set  of  semantic  graphs.  The  Semantic  Linker  seeks  to  connect  these 
sub-graphs  into  a  single  connected  one  by  adding  links  between  nodes  in  the  different 
sub-graphs. 
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I'igure  3.3:  Fragmciii  Graphs 

At  lop-IcNcl,  this  IS  111  •  stmie  (lung  that  the  parsct  .mh  graittiitar  lIo.  The  diirL'iciiL'c  is  that 
the  parser  and  gr  .nn  tr  liavo  an  idea  ol'  vGiat  the  eiamiiiatical  relatiDiiship  helween 
eoiistittieiils  is,  based  on  requirements  ol'  tlieir  prosimitj  in  the  string  and  other  synttietie 
evidenee,  I'lie  Setnaittic  Linger  does  not  have  these  requirements,  being  i.  looser  I'ornt  of 
eotiihinutiun  that  eun  ignore  fragment  order  atid  skip  over  interveiting,  unanaly/able 
ntaierial  with  ease. 

Althougit  It  is  a  very  differenl  algorithm,  the  Semantte  Linke’'  uses  the  same  set  of 
realization  rules  that  drives  the  regular  parser.  Using  the  realization  rules,  the  Litiker 
detennlnes  for  caeli  pair  of  tiodes  iti  different  semantie  graphs  the  set  of  all  links  which 
can  ccntiect  tliem.  it  Ihett  uses  an  A*  search  to  (tnd  the  most  plausible  set  of  links  wltich 
produce  a  complete  graph. 

Suppose  for  example,  we  have  the  three  fragments  "to  Boston",  “Denver"  and  “Delta 
nights  on  Monday".  Then  the  three  corresponding  sub-gruph:i  are  those  .showti  in  Figure 
3.3  where  a  PF  is  treated  as  its  NP  object  witli  the  preposition  as  a  tag,  F'or  this  set  of 
fragmentary  sub-graphs,  the  possible  links  arc: 


la.  FLIOHTSl -  DE3T-OF  ->  BOSTONxTO 

lb.  FLIGHTSl -  ORIG-OF  ->  BOSTONsTO 

Za.  FLiGKTSl**"—  DEST-OF  — >  BEufVER 

2b.  FLIOHTSl -  ORIG-OF  ->  DENVER 

3a.  DENVER -  NEARBY- TO  ->  BOSTON:TO 


where  the  links  are  grouped  together  in  a  ordered  list  according  to  the  fragtnent-pairs  they 
connect. 

The  plausibility  of  a  given  Ititk  is  a  function  of  a  number  of  different  features,  iticluding 
penalties  from  assumptions  made  in  its  compulation  (e.g.  that  a  given  preposition  can  be 
ignored  or  assumed)  and  empirically  determined  probabilities  for  the  given  link  (e.g.  that 
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given  .111  All^LlNl:  and  u  I-'LIGH  T  they  are  mosi  'hablv  linked  by  ilio  reliiiiiin 
AIRL1M;-01'), 

The  Scinuniie  Linker  may  also  "luillucinaic"  a  iicu  node  to  bridge  two  I'ragniems  between 
whont  no  links  can  ollicrwise  be  computed.  Fore\.implc,  for  the  utlerance  "from  Boston  lo 
'Denver",  which  lias  no  explicit  FLlGHT-object,  a  I  LlGUr  node  can  be  inserted  between 
'the  fragmenis  to  make  .sense  of  the  utlerance. 

Because  ihe  .Semantic  Linker  uses  the  same  set  ol  u  alizaiion  rules  as  the  resl  of  ihe 
syslem,  when  the  syslein  is  ported  lo  a  new  domain  ihe  Semamic  Linker  can  be  used 
immcdiaiely  a  disiincl  advaniagc  over  sonic  oiliei  .ipproaehes  lo  fallback  undersiaiiding, 
such  as  1 23 1  or  (361, 

In  formal  experinienis  (as  see  discuss  subseijueiulv  die  .Seiiiaiiiic  I. inker  has  been  show  lo 
draiiiaiically  improse  Delphi's  perforniaiice. 


3.4  Quantification 


The  cjuantilier  scoping  module  in  Delphi  takes  u  semantic  graph  and  produces  a 
fully-scoped  expression  in  ihe  logical  language  I’MRL.  The  biisic  strategy  for  quantifier 
scoping  is  a  descendant  of  that  u.sed  in  the  LUNAR  sy.sleiii  177|.  This  is  made  possible  by 
the  use  of  die  .semantic  graph  as  a  coinmon  underlying  rcpre.sentaiion  for  both  the 
graniniatical  and  ill-formed  parts  of  fragmentary  uiierances.  Delphi’s  scoping  module  naps 
quantiliers  from  relative  clau.scs,  makes  the  quaiitiliers  from  FPs  etc.  outscope  the  NP 
quantifier,  and  resolves  the  .scojx;  of  quantifiers  from  par..ijcl  constituents  in  terms  of 
Icft-to-riglil  order  in  the  input,  These  general  rules  are  modified  to  lake  into  account 
differing  strengths  of  quantiliers  such  as  EACH. 

■  Lefi-to-right  ordering  and  syntactic  structure  for  grammatical  poitions  of  the  utlerance  are 
recovered  from  the  semantic  graph  by  backpointers  to  the  lexical  items  and  grammatical 
relations  from  which  the  graph  was  produced.  Links  established  by  the  Semantic  Linker 
arc  treated  by  the  quantification  mechanism  as  if  the  con.slituency  is  indeterminate,  so  that 
only  left-to-riglit  scoping  rules  and  individual  quantifier  preferences  lake  effect. 

The  resulting  mechanism  is  robust,  and  quantiiicatioiial  scoping  has  been  an  insigiiilicant 
source  of  enur  in  the  official  ARPA  blind-test  evaluations  of  the  AXIS  system.  More 
complex  strategies  have  been  proposed  and  implemented  in  the  last  two  decades,  and  could 
in  principle  be  modified  to  work  with  ill-formed  input,  but  the  simple  and  robust  LUNAR 
approach  handles  essentially  all  the  phenomena  .seen  in  the  tens  of  thousands  of  sentences 
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ol  ATIS  l-oIIcciciI  diinnp  expennicms  wii',  'in-liniiinsl  users. 


3.5  Discourse 


The  discourse  mecliauisni  of  Delplii  consi.sis  of  seser.il  coniponcnis;  rcsoluiion  of  local 
ambiguiiies.  pronominal  and  deictic  antecedent  icm  lutioii.  ellipsis  handling  and  discoLirse 
constraint  propagation. 

The  most  common  case  of  local  ambiguity  in  the  VI  LS  domain  involves  temporal  phi’iises 
as  in  ‘'tbe  nine  o'clock  (light”.  The  re.solution  ine-.lianism  searches  both  for  linguistic 
information  in  the  current  and  previous  sentences,  .is  well  as  properties  of  entities  in 
|trcvious  anssvers.  to  resolve  svhetber  "nine  o’clock"  is  AM  or  PM. 

The  pronoun/deiciic  resolution  mechanism  used  in  Delphi  makes  use  of  locally  expressed 
or  implied  semantic  constraints  to  search  through  a  set  of  candidate  antecedents.  The 
current  mechanism  ignores  .syntactic  nuinlter  as  a  cue.  because  empirically  In  the  ATIS 
corpus  (and  we  suspect  in  othci  spontaneous  speech  applications)  it  is  often  in  error.  A 
simple-minded  focus  component  is  u.scd,  primarily  bu.scd  on  recency,  and  .secondarily 
ba.scd  on  grammatical  relations  within  an  utterance.  Because  of  the  strength  of  semantic 
cues  and  the  prevalence  of  ill-formed  input,  the  use  of  syntactic  cues  for  focus  is  limited. 

The  interpretation  of  later  sentences  often  must  include  information  from  previous 
sentences,  without  explicit  linguistic  cues.  This  is  especially  true  in  "design  dialogues", 
where  thc.goal  is  to  lind  a  description  of  a  set  of  objects  tl  '  will  meet  some  .set  of  implicit 
or  explicit  constraints.  Consider  for  example  the  following  discourse  from  the  ATIS 
domain. 


Show  Della  flights  from  Boston  to  Dallas  tomorrow. 
Can  I  leave  in  the  morning? 

Is  there  a  nonstop  flight? 

Show  me  the  American  flights. 

I  want  to  go  from  Dalias  to  Chicago  on  Wcdne.sday 


Note  that  the  constraints  of  prior  sentences  (such  as  on  airline,  origin,  destination  etc.)  are 
implicit  for  subsequent  sentences  unless  contradicted  by  information  in  the  current  sentence 
(c.g.  "American”  overrides  the  "Delta"  from  the  first  sentence)  or  until  there  is  evidence 
that  a  new  problem  is  being  solved  (the  new  origin  and  destination  in  the  last  sentence 
indicates  that  all  previous  constraints  can  be  dropped).  Delphi  has  a  “context  tracker"  that 
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niaimains  a  suiL'k  of  the  coiisuaiius  from  previous  .  icranecs.  ami  Inis  a  scl  of  rules  for 
uhen  t'oiisiraims  arc  lo  he  modilicd  oi  dcleied  hoi"]..'  Iiciiig  iiicrp.eti  wiih  lire  curren! 
scfitcncc. 

Finally,  we  haiulle  ellipsis  as  a  special  ease  of  semaniie  linking.  If  we  have  ihe  two 
uiteranees: 


Show  iiif  III!'  nuuils  <»t  tlu'  >iu)ininf>  Ilii’hi. 
-Ill  Aiiicriciiii  til  12:31) 


We  can  treat  these  as  if  they  were  one  nin-on  ill-fonneil  input  aiul  link  '‘Ainericiin  to 
"llight'',  and  replace  "inorning"  with  ■‘I2:3()".  using  .1  minor  sariant  of  the  .Semantic  Linker 
linker  which  allows  for  later  eonstrtiints  to  overwrite  earlier  ones  of  the  same  type.  This 
strategy  has  heeii  ‘.eiy  elfeetive,  and  e-  'is  a  large  class  of  elliptical  constructions. 


3.6  Backend  I’vlapping 


In  order  to  get  a  response  to  a  user  quety,  the  complete  FMRL  interpretation  of  an 
utterance  must  he  translated  to  an  expression  of  a  target  query  language  which  can  be 
evaluated  directly  against  the  tabular  dulubasc  to  retrieve  the  answer. 

A  key  step  is  bridging  the  gap  in  eoneeptual  vocabulary  bctwee!i  the  two  representations. 
For  example,  the  FMRL  interpretation  of  the  query  "Mow  many  flights  on  Delta  serve 
meals"  has  one-place  predicates  like  FLIGHT  and  AIRLINE,  and  two-place  predicates  like 
AIRLINE-OF  and  MEAL-OF.  The  database  for  the  AXIS  domain,  on  the  other  hand,  only 
has  a  single  table  FLIGHT  with  fields  containing  airline  and  meal  information.  Delphi 
bridges  this  gap  between  representations  with  a  system  of  local  mapping  rules  which 
translate  the  one-  and  two-phice  predicates  of  the  FMRL  into  expressions  of  a  relational 
algebra  target  language  which  retrieve  the  extensions  of  these  predicates. 

Sometimes,  however,  some  combination  of  FMRL  predicates  has  a  correspondence  in  the 
database  but  the  individual  predicates  them.selves  do  not.  For  example,  in  the  database  for 
the  SPLINT  domain  a  table  relating  aireraft-types  to  their  physical  characteristics  has  a 
field  for  the  number  of  engines  the  aircraft  has,  but  no  representation  for  the  engines 
themselves,  if  we  now  ask  “How  many  engines  does  an  F-16  have'?”,  there  is  no  local 
iranslalicm  of  the  FMRL  predicate  ENGINE. 


■  To  deal  '  ith  this,  Delphi  has  a  system  of  global  transformations  that  are  applied  first. 
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rewriOiig  subsets  of  the  FMRl,  clauses  to  a  fonti  ;  .it  can  be  li.mtiled  wiili  local  translation 
The  rule  that  handles  this  example  is: 

(is-a  :e  engine  number) 

■  (aircraft-engine-ol'  :a  :e) 

(is-a  *count*  number) 

(cq  (numbcr-engincs-ot'  :a)  *coiini*) 


3.7  IiUerface  To  A  Speech  Reco}>nizer 

In  spoken  language  applications,  Delphi  is  inieiTaced  to  the  output  of  the  Byblos  speech 
recognition  system  19),  The  N-besl  paradigm  is  used,  in  which  the  recognizer  outputs  in 
order  its  top  N  guesses  at  the  transcription  of  the  sentence,  for  some  value  of  N  (usually 
5).  Delphi  then  runs  over  these  transcriptions  in  the  order  they  have  been  ranked,  lirst  with 
the  Semantic  Linker  disabled  so  that  only  grammatical  ulteranccs  arc  allowed,  and  if  none 
is  found,  runs  over  them  again  w  ith  the  ,Semaiuic  Linker  enabled, 
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3.8  Results  Of  Formal  Evaluation  On  AXIS 


Our  complcic  system  including  the  Semantic  Liiikc-i  was  evaluated  in  the  December  1993 
ARPA  ATIS  evaluation.  Prior  to  evaluation,  ATIS  versions  of  the  system's  domain  model, 
‘lexicon  and  realization  rules  had  been  developed  using  .several  thousand  utterances  of 
■  training  data  collected  from  users  of  ATIS.  An  approximately  1000-utterancc  set  was  held 
aside  as  a  blind  test  set  on  which  all  nailicipating  sites  were  c'  ilualed. 

Error  rate  in  this  evaluation  svas  defined  as  I'+NA,  where  F  was  the  percentage  of  queries 
answered  incorrectly,  and  NA  the  percentage  of  queries  not  answered  at  all.  There  were 
two  evaluations  on  the  same  corpus  using  this  metric:  one  of  Nf.  text  understanding  alone, 
and  the  other  of  a  complete  spoken  language  system  tSLS)  eomi)ri,sed  of  Delphi  and  the 
Byblos  recognizer,  Our  system  achieved  an  official  result  of  14.7%  on  the  NL  test,  which 
was  the  third-lowcst  error  rate  achieved,  fhe  SLS  error  rate  was  17.5%. 

Our  own  cxperinients  show  that  using  the  .Semantic  Linker  reduced  our  system's  error  rate 
on  the  NL  test  by  43ff .  This  vvas  largely  achieved  by  dramatically  lowering  the  no-answer 
rate  NA  frotii  18.7%  to  2.39'('.  Just  over  80%  of  this  increment  of  sentences  answered  were 
answered  correctly,  so  the  Linker  showed  considerable  accuracy. 


3.9  Porting  Delphi  to  the  SPiJNT  Domain 


Although  the  language  understanding  technology  that  is  Delphi  has  greatly  improved  (as 
evidenced  by  objective  ARPA  evaluation)  and  has  been  incorporated  into  real-time 
demonstrations,  there  has  been  comparatively  little  effort  to  make  systems  truly 
transferable  to  various  types  of  application  systems  and  domains,  and  to  develop  and 
optimize  human-machine  interface  paradigms  for  dual-use  applications. 

Transferability,  also  called  portability,  is  key  to  the  development  of  robust,  practical,  usable 
systems  and,  thus,  to  making  a  wide  variety  of  applications  truly  practical,  Portability  has 
long  been  a  goal  ([17,  16,  34])  but  seldom  has  been  achieveo. 

Delphi  has  been  developed  under  the  premise  that  as  much  general  linguistic  knowledge  as 
possible  should  be  built  into  the  system  in  a  domain-independent  way,  modularizing  the 
domain-dependent  information  to  knowledge  ba.ses  and  easily  replaceable  components.  By 
making  appropriate  use  of  general  linguistic  patterns,  Delphi  enormously  reduces  the 
amount  and  complexity  of  knowledge  needed  to  install  a  new  domain. 
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riio  SPl  IN  r  I  speech  iind  Language  Imegr.iiionl  doiiiuin  was  made  available  under  a 
separaie  eoniraei  with  Rome  Lalxrraiory.  Development  of  portable  modules  and  tools  to 
assist  the  porting  process  were  done  under  'his  contract;  the  actual  porting  was  carried  out 
under  the  Rome  contract. 

the  SPLINT  domain  is  concerned  with  Air  Force  units  and  their  component  aircraft, 
■weaponry  and  other  physical  attributes  of  aircraft,  ordnance,  and  facilities  (such  as  air 
bases,  runways,  bunkers,  etc.).  It  may  be  cemsidered  a  resource  management  domain,  vvith 
a  relational  database  at  its  ticari.  fhe  SPLINT  database  has  106  liclds  in  2.4  tables, 

liy  studying  the  database,  developers  were  .ibic  to  create  an  initial  corpus  of  sample 
questions  that  might  he  asked  about  the  dat.i  Some  additional  queries  were  piovided  by 
Rome  Laboratory.  The  original  set  of  questiotis  was  augmented  by  including  variations  iluii 
differed  by  substituting  dilTeient  words  of  the  same  type  ifor  example,  different  missile 
natnes).  In  this  way,  an  initial  text  corpus  of  ntoro  than  9t)0t)  sentences  was  created. 

Some  example  utterances  in  the  SPldNT  domain  arc; 


Whal  i.s  the  ntiif’c  of  the  AGM-65C  iiuivorU  k  missile 
How  many  airforce  bases  are  itieiv  in  the  US  Military  Area 
Whal  are  they 

Where  are  the  units  with  aircraft  that  carry  vulcan  tail  cannons  stationed 

Show  me  a  map  of  Griffiss 

Which  runways  there  are  operational 

What's  the  length  of  the  longest  runway 

Sort  the  air  to  air  missiles  by  range 


This  corpus  was  abstracted  into  query  schema,  so  that  wc  could  more  easily  examine  it  for 
completeness: 


(AIUS  THB  <DESCRZPTION-PL>  AT 

(AFB-DESCRIPTION}  OPERATIONAL) 

{AFB-DESCRIPTXON}  - 
<A£'B-NAME> 

<AFB-NAME>  AIR\''FORCE\"BASE 
<AFB-NA11E>  AFB 
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<AFB-NAME>  * 

GRIFFISS 

14ANGLEY 

Tlic  purpose  of  the  query  .schema  was  NOT  to  provide  a  complete  represcntatioit  of  all  the 
■questions  that  could  be  asked  of  the  system,  but  rather  to  make  it  easier  to  be  sure  that  the 
training  set  contained  a  wide  range  of  ro-'iesontativo  queries  that  contained  an  appropriate 
distribution  of  entities. 

In  order  to  port  Delphi  to  the  .SPLINT  domain.  SPLlNT-speciiic  versions  of  the  domain 
model,  K'sieon,  realization  rules  and  db-ma|)ping  rules  were  needed,  l-'or  the 
speech-understanding  part  of  the  application,  word  pronunciations  were  also  necessary,  ..■> 
well  us  vvord-ela.ss  membersl  'p  for  a  statistical  n-gram  class  graminar.  Delphi  includes 
"core"  versions  of  some  of  these  knowledge  bases;  a  core  domain  model  with  common 
classes  like  NUMBER  atid  TIME-OE-DAY  and  relations  like  GREATER  a  core  lexicon 
with  closed-class  items  such  as  prepositions  as  well  as  words  appropriate  to 
question-answering  in  general  such  as  "show",  to  which  domain-speeilic  items  have  to  be 
added. 

In  porting  to  SPLINT,  60  classes  and  65  relations  were  added  to  the  domain  model.  400 
words  were  added  to  the  lexicon.  0(  these,  approximately  half  were  derived  from  database 
field  values.  1 18  realization  rules  were  added. 

The  domain  model  was  built  by  a  combination  of  hotioni-up  tdaiabase-structurc  driven) 
and  top-down  (corpus  driven)  techniques.  The  initial  corpus  was  annotated  for  surface 
meaning  using  a  variant  of  the  notation  being  dcvclo|}ed  by  the  ARPA  community  for 
semantic  evaluation.  This  make  it  possible  to  determine  the  set  of  concepts  and  relations 
corresponding  to  the  linguistic  expressions  represented  in  the  corpus. 

The  grammar  did  not  need  to  be  modified,  with  the  exception  of  adding  one  rule  (for 
constructions  such  as  "Mach  1"), 

The  entire  process  took  about  a  person  month  to  get  90%  coverage  on  a  1 400  sentence 
corpus,  developed  independently  by  a  non-NL  person.  An  additional  person  week  was 
required  to  develop  the  speech-related  knowledge  bases.  A  complete  spoken  language 
system  with  Delphi  as  the  understanding  component,  plus  a  Motif-based  user  interface,  was 
successfully  demonstrated  at  the  1994  ARPA  Human  Language  Technology  meeting,  and  at 
Rome  Labs  in  New  York.  The  porting  process  is  described  in  more  detail  in  [14,  6]. 

This  effort  demonstrates  that,  given  an  approprial  j  system  design,  it  is  possible  to  build  a 
complete  spoken  language  system  that  is  robust  to  speech  and  production  errors,  and  to  do 
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so  r.i|:nll\  ,iiKi  '.iraiyliirolwardly. 


3.10  Conclusion  And  Summary 


In  conclasion,  \vc  have  devolopeiJ  a  leclinology  ilial  makes  maximal  use  of  geiieial 
linguisiie  kiunvledge  (o  improve  poiiahilily.  while  ai  the  same  lime  mainiainiiig  rohusiness 
111  ilie  lace  of  ihe  type  of  iniiui  one  can  expect  I'lom  a  real-lilc  spoken  language  application. 
The  system  has  heen  sliosvn  lo  leaeh  high  levels  ol'  perl'ormaiKe  in  objective  blind-tesi 
evaluation  on  the  ATIS  domain.  The  system  lias  also  been  sIiovmi  to  he  lajiidly  portable  to 
a  new  domain,  SPLINT.  This  did  not  require  any  changes  in  the  underlying  system  eoile, 
and  was  done  wiih  a  relatively  small  olTort. 

This  work  shows  that  eoiiiputational  linguistic  methods,  based  on  general  knov  ledge  of 
language,  can  be  used  in  large,  robust  spoken  language  systems,  and  that  siiecial-purpose 
NL  understanding  systems  do  noi  have  to  be  built  for  each  new  task. 


Chapter  4 


The  Semantic  Linker 


4.1  Introduction 


This  chapler  prcscnls  ihc  Scmaniic  Linker,  a  new  mechanism  ILr  understundiiig  ill-foiined 
input.  The  Linker  is  the  domain-iiidepeiidem  fallback  understanding  coinponcnl  used  by 
the  natural  language  component  of  our  spoken  language  system.  The  Semantic  Linker  is 
invoked  wlicn  our  regular  parser  is  unable  to  parse  an  input;  it  produces  a  semantic 
interpretation  by  combining  the  fragmentary  sub-parses  left  over  in  the  chart  using  an 
A*-stylc  search  algorithm  driven  by  empirically  determined  probabilities  and  parameter 
weights,  The  Semantic  Linker  also  provides  u  novel  method  of  handling  ellipsis,  The 
Semantic  Linker  was  used  in  the  ARPA  December  1993  'TIS  evaluation,  where  it  reduced 
our  system’s  error  rate  on  the  NL  lest  by  43%  -  from  31.1%  to  17,8%.  This  was  one  of  tlie 
lowest  error  rates  of  any  system  tested. 

All  important  problem  for  natural  language  interfaces  is  coping  with  input  which  cannot  lx: 
liandlcu  by  the  system's  grammar.  A  system  which  depends  on  its  input  being  grammatical 
(or  on  lying  within  the  coverage  of  its  grammar)  simply  will  not  be  robust  and  useful. 
Some  sort  of  “fallback”  component  is  therefore  necessary  as  a  complement  to  regular 
parsing. 

This  chapter  presents  the  Semantic  Linker,  the  fallback  component  used  by  the  natural 
language  component  of  our  spoken  language  system.  The  Semantic  Linker  is  invoked  wlicn 
our  regular  chart-based  unification  grammar  parser  is  unable  to  parse  an  input;  it  attempts  to 
come  up  with  a  semantic  interpretation  by  combining  the  fragmentary  sub-parses  left  over 
in  the  chart.  The  Semantic  Linker  was  used  in  the  ARPA  December  1993  ATIS  evaluation. 
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uhcrc  II  roiliRwl  mir  ,s\sti;iii's  crmr  r.iic  on  ihc  Ni  .'m  b\  il'n  ni  .M.l'/i  lo  l7.S7r). 

Our  work  is  moiivaied  by  the  observation  tliat  syni.ix.  b)  ilscii',  is  only  useful  insofar  as  it 
helps  tis  in  coming  up  with  a  semantic  interpretation.  Unlike  such  proposals  us  [421,  123| 
or  172],  we  do  not  attempt  to  ‘‘(ix"  the  parsing  process  or  reconstruct  a  parse  tree  from 
fragments,  but  instead  try  to  directly  produce  at  intermediate  predicate-argument  structure 
we  call  a  "semantic  graph".  A  set  of  fragments  corresponds  to  a  disconnected  set  of 
semantic  graphs,  and  the  task  of  interpretation  is  to  lind  a  set  of  links  -  binary  relations  - 
ihul  connect  Ihc.sc  disconnected  semantic  graphs  mto  a  single  connected  one.  limpirically 
determined  piribabilities  and  paramcicr  weights  are  iivcd  in  a  A ’'■style  bcsi-lirst  search  lo 
lind  the  most  plausible  set  of  connections. 

The  Semantie  Linker  also  differs  crucially  from  proposals  such  as  1.36]  in  that  it  does  not 
rely  on  task-model  templates  to  sttive  tire  ill-formodness  problem,  ll  is  therefore 
completely  doinain-iiulependenl  and  cati  be  u.sed  for  atiy  task.  In  addition,  the  Semamic 
Linker  provides  a  novel  method  of  ellipsis  resoltiiion  which  is  integrated  with  fallback 
understanding. 

We  devote  the  ttext  section.  .Seelion  2,  to  a  tnore  detailed  description  of  semantic 
interpretation  and  fragment-generatitm.  Section  3  will  discuss  how  the  space  of  all  possible 
links  and  associated  probabilities  arc  generated.  Section  4  shows  how  we  efiicicntly  search 
this  space  to  produce  a  connected  semantic  graph  interpretation.  Section  5  discus.scs 
subsctiuenl  processing,  and  Section  6  presents  formal  evaluation  results  and  our 
conclusions  and  future  plans. 


4.2  Generating  and  Interpreting  Fragments 


The  semantic  interpretation  of  any  con.siituent  in  our  system,  whether  a  fragment  or  u 
complete  .sentence,  is  represented  by  u  “semantic  graph",  in  which  the  nodes  are  the 
semantic  interpretations  of  lexical  heads  and  the  links  arc  the  seman'  e  relations  between 
them.  For  example,  "Delta  (lies  a  747  to  Denver"  is  represented  by  the  semantic  graph; 

/ - AIRCRAFT-OF  ->  747 

FLY - AIRLINE-OF  ->  DELTA 

\ - DEST-OF  ->  DENVER: TO 

A  PP,  such  "to  Denver"  in  this  example,  is  represented  by  the  semantic  graph 
representation  of  its  NP  object,  tagged  by  the  preposition  of  the  PP.  Quaniilicrs  arc  left  in 
place  to  be  pulled  out  and  scoped  by  later  processing. 
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The  '.cnuiniic  graph  l\)r  an  uucranco  is  incrciiiema..  eompiiieJ  I'roin  the  syniueiie  analysis 
of  ilie  latoianee  using  a  system  of  "reali/atioii  riik"-  trliieh  map  the  giammaiieal  relaiion 
an  arguiiicnt  bears  to  the  head  onto  the  sematitic  relation  ilie  iiuerpreiution  of  the  argnmeni 
bears  to  tlic  interpretation  of  the  head.  There  arc  grammatical  relations  corresponding  to 
Ihc  familiar  notions  of  .subject  and  direct-object  etc.,  as  well  as  to  prepositior.-tags  like  “to" 
.above.  In  our  example,  the  "TO"  grammatical  relation  holds  between  "flics"  and  "Denver". 
It  is  mairped  to  the  DEST-OF  .semantic  relation  by  the  realization  rule: 


TO  ->  DEST-OF 


where  ilv  semantic  type  requirements  on  head  aiu'  I’l’  object  are  implicit  from  the 
delinition  of  the  relation.  A  different  set  of  rcaliz.ition  rules  is  used  for  each  domain. 

When  a  complete  parse  of  an  utterance  cannot  be  perfortned,  we  are  left  with  a  set  of 
fragmentary  analyses  in  the  chart  which  correspond  to  syntactically  well-formed  and 
semantically  coherent  portions  of  the  input  string.  The  fragment-generation  stage  of  the 
Linker  extracts  the  most  plausible  fragment  sub-parses  associated  with  the  longest 
sub-strings  of  the  input,  using  probabilities  associated  with  the  grammar  rules  120).  It  uses 
a  "greedy"  algorithm  which  first  choo.scs  from  the  chart  the  coherent  fragment  spanning  the 
longest  sub-string  of  input,  then  the  coherent  fragment  spanning  .he  longest  sub-string 
disjoint  from  the  first  sub-string,  and  so  on,  until  a  set  of  fragments  has  bcci.  generated  that 
spans  the  entire  input  string. 

Each  fragment  has  an  associated  semantic  graph  as  its  interpretation.  Suppose  for  example, 
we  have  tbc  three  fragments  "to  Boston",  "Denver"  and  "D  'ta  flights  on  Monday".  Then 
the  three,  conesponding  sub-graphs  arc: 


POSTON  I TO 
DENVER 

FLIGHTSI - AIRIjINE-OF  ->  DELTA 

\ -  DAY-OF-WK  ->  MONDAY: ON 


This  set  of  n  sub-graphs  can  be  turned  into  a  complete  connected  graph,  and  therefore  a 
complete  interpretation  of  the  utterance,  if  we  can  find  a  set  of  n  -  1  new  links  between 
nodes  in  different  sub-graphs. 
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4.3  Computing  the  Possible  Links  and  Their  Probabilities 

As  a  lirst  step  in  linding  the  bcsi  sci  of  coiinaciin};  links,  the  Semantic  Linker  computes  the 
.“link-group  list",  which  has  one  element,  or  “link  group”,  for  each  pair  of  fragments.  The 
link-group  for  a  pair  of  fragments  is  simply  the  set  of  links  that  could  connect  the  two 
fragments,  wlicrc  each  link  connects  an  object  in  one  fragment's  semantic  graph  to  an 
object  in  the  other  fragment's  semantic  graph.  Tite  links  are  computed  using  the  same  set 
of  reali/aiion  rules  that  drive  the  parser  and  semantic  imcrpreicr.  They  depend  on  the 
semanlic  typos  of  the  two  objocis  and  on  the  prep.-Miion  lag  iif  any  )  of  the  second  object. 
This  tag  can  be  rcla.sed  or  assumed  with  a  penaliv  ,is  we  shall  see  belovv.  Lor  the  set  of 
fragments  in  our  example  ihc  link-group  list  is: 

la.  FLIGHTSl -  DEST-OF  ->  BOSTON! TO 

lb.  FLIQHT31---  ORIG-OF  ->  BOSTON: TO 

2a.  FLIGHTSl---  DEST-OF  ->  DENVER 

2b.  FLIGHTSl---  ORIG-OF  ->  DEaiVER 

3a.  DENVER -  NEARBY-TO  ->  BOSTON: TO 

where  the  links  are  grouped  together  in  a  ordered  list  according  to  the  fragmcni-pairs  they 
connect.  Since  there  arc  three  fragments  in  this  example,  there  arc  three  pairs  and  thus 
three  groups. 

Each  link  has  a  set  of  features  which  are  computed  along  with  the  link.  The  most 
important  is  the  relational  probability  of  the  link,  or: 

P(R,C1,C2) 

where  R  is  the  semantic  relation  of  the  link  and  Cl  and  C2  are  semantic  classes  of  the  two 
argument  positions,  and  C2  may  be  tagged  by  a  preposition.  This  is  the  probability  that  a 
pair  of  objects  of  type  Cl  and  C2  are  linked  by  by  the  relation  R  in  an  interpretation 
(instead  of  linked  by  some  different  relation  or  not  linked  at  all). 

A  corpus  of  interpretations  generated  by  hand  could  be  u.sed  to  determine  these 
probabilities,  but  we  instead  use  a  corpus  of  3000  semantic  graph  interpretations  of 
sentences  that  our  regular  parser  is  able  to  analyze  correctly. 
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I'roni  lhl^  corpii>.,  wc  can  dcierniiiic  tliai  the  link  la  has  a  high  (.89)  pmbabilily  of 
conncciiMi;  a  l-LIGHT  and  Cri  'iMO  ohjecl.  '.Uicrcas  ihc  link  .4a  has  a  near  zero  probahiliiy, 
since  ihc  iclaiion  NEAR13Y-CITY-01' occurs  very  imroiiucmly  teiween  two  cities. 

Links  can  have  other  features  depending  on  assuii'ption.s  made  in  computing  them.  For 
example,  a  link  can  be  computed  by  ignoring  the  picposiiional  tag  of  the  second  object,  in 
which  ca.se  the  link  is  given  the  feature  “IGNORES-I'REP".  An  example  would  he  lb 
above,  which  ignores  the  preposition  "  .\  link  can  also  be  computed  ny  assuming  a 

prepositional  lag  that  is  not  present,  giving  the  link  the  feature  '‘ASSUMF,.S-FREP",  as  in 
.4a,  where  the  preposition  "near''  is  assumed.  .\s  we  shall  see  in  the  next  section,  these 
features  are  also  assigned  negative  weights  as  penalties,  balancing  out  any  higher  relational 
prohabilii\'the  link  may  have  gained  from  the  assumptions  mtide  by  it. 


4.4  Searching  the  Space  of  Combinations 


In  order  to  structure  the  search  so  as  to  avoid  redundant  links  and  duplication  of  search 
states,  we  order  the  link-group  list  arbitrarily.  Levels  of  the  .search  tree  correspond  to 
elements  of  the.  link-group  list.  At  each  level,  search  states  are  expanded  by  generating  a 
new  state  for  each  link  L  in  the  corresponding  link -group.  L  is  added  to  the  links  already 
chosen  by  the  parent  state  to  make  the  links  chosen  by  the  new  .state.  An  additional  “skip” 
state  is  generated,  which  represents  the  choice  not  to  add  any  of  the  links  in  that 
link-group.  Its  links  arc  just  those  of  its  parent  state. 

Delow  is  a  portion  of  the  search  tree  for  the  problem  of  connecting  our  example  .semantic 
graph.  Each  search  state  is  labeled  with  the  link  i;  chose,  or  with  "s”  indicating  the  skip 
■  state  where  no  link  was  chosen: 


/-  ■ 

a —  2a--  3a 
/  \-  2b--  3a 
/  /--  S--  3a 

START- -la--  2a 
\  \—  2b 

\  /--  a  --  3a 
lb--  2a 
\--  2b 


If  a  state’s  chosen  links  connect  all  the  fragments,  the  state  is  said  to  be  “complete”,  and 
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11(1  nioi'L'  (.'Npaiisioii  IS  (Jane  on  it.  Conipleic  suites  'ii  ilic  tree  above  iiielinJe  <a,s,3a>, 

<  la,2a>,  <  |h,2a>  etc.  If  the  slate  runs  oiii  of  link -groups  to  try.  (lie  stale  is  saiiJ  to  be 
“eJead",  and  no  more  expansion  is  done  on  it.  A  dead  stale  above  is  <s,s>. 

An  iiiiportant  coiisiraint  on  stales  is  that  they  not  include  eoniradielory  links.  If  a  link  of 
the  I'orni  (R  A  B'),  where  R  is  a  single-valued  rciaiion,  is  added  to  a  stale  tiial  already 
includes  a  link  of  the  form  (R  A  B),  the  clash  between  B  and  B'  must  resolved  by 
aiiempiing  to  unify  the  two  sub-graphs  rooted  at  these  nodes.  It  unilieaiion  fails,  the  state 
is  classified  as  ineoliercnt,  and  not  expanded  further,  lixaniples  of  incoherent  states  in  the 
space  above  are  la,2a>  and  <lb,2b>.  /k  '■sueccss''  state  is  a  conipleio.  coherent  state. 
Some  e.vamplcs  .ire  <s2a..3a>.  <la.2b>  and  <lb.2.i>. 

bach  stale  is  associated  with  a  score,  or  "cost'’,  which  is  the  sum  of  the  log-probabilities  of 
its  chosen  links,  plus  penalties  for  any  other  feature  -  the  stale  may  have,  |ilus  an  eslimaitd 
cost  for  each  link,  if  any,  iliat  still  has  to  be  added  to  eonneel  the  graph.  This  score  is  used 
to  guide  an  A*-siylc  besl-lirst  search  through  the  space. 

Link  da  has  very  low  probability,  while  link  lb  has  the  IGNORLiS-RRER  feature.  States 
choosing  these  links  will  therefore  have  low  scores.  The  other  states  are  incirherent,  and 
the  search  will  therefore  produce  slate  <la,2b>  as  the  bvsi  state,  with  a  complete,  coherent 
sciriantic  graph  interprclalion, 

For  reasons  of  space,  we  have  used  a  very  simple  example  here.  Below  arc  actual 
sentences  of  the  formal  le.sl  evaluation  which  the  Linker  handled  cortectly: 

How  much  is  a  coach  jU^^hi  the  cheapest  coach  flight  on  Soaihest  Airlines 

Phoenix  till  to  Milwaukee  on  Siiiulay 

Los  Angeles  to  Pittsburgh  afternoon  Tuesday 

List  flights  from  Orlando  to  Tacoma  on  Saturday  of  fare  basis  code  of  Q 
What  airline  is  A  S  as  in  Sam 

Instead  of  Saint  Louis  how  about  a  plane  that  stops  in  Denver 


4.4.1  Handling  Corrections  and  Ellipsis 


A  common  problem  faced  by  ordinary  par.-.ers  is  speaker  disduency: 


Tell  me  the  flights  to  Denver  uhh  to  Boston 
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Tins  will  piodiico  llic  riiij>nicms  "Tell  rne  iho  llieliK  lo  Deinei"  and  “to  Boston''.  Since  a 
lliglil  can  hate  onl>  one  DBST-OT  th.'  Iiagmeni  "to  Hostoii"  cannot  be  linked  aceoiding  to 
its  most  salient  DliST-OI'  interpretation.  The  alternative  would  be  to  ignore  the  "to" 
preposition  and  attempt  to  link  “Boston"  as  an  ORlCi-OB  with  the  IGNORli-l’REB  I'caiurc. 

This  clearly  would  not  itroduce  the  correct  inicrprctalion,  however.  The  Linker  provides  an 
alternative  when  the  clashing  value  is  to  the  right  of  the  existing  '  uluc  in  the  siriitg.  In  this 
ca.se,  graph  unification  operates  in  an  asymmetrical  overwrite  mode,  in  which  values  to  the 
right  in  the  string  replace  values  to  the  left.  The  resulting  state  leeeives  the  combinational 
feature  REPLACEMENT,  sehich  is  not  penali/ed  strongly.  If  the  relational  probability  of 
the  DE.ST-OE  link  is  good,  it  ss  ill  defeat  its  KiNORlM’REP  rival,  as  it  slutuki,  In  this 
svay,  we  are  able  to  integrate  some  aspects  id'  the  dislhieney-handling  in  1 1')|  with  the  otlter 
types  of  ill-formediiess  discussed  here. 

Oraph  utiilication  with  overwriting  is  also  used  for  ellipsis-handling.  In  ellipsis-handling 
tttode,  the  Linker  tries  to  combine  the  semantic  graph  of  the  cuirent  uttcrattce  into  the 
semantic  graph  of  the  previous  one.  Tor  e.xample.  cotisider  the  dialogue: 


''What  flights  fly  to  Denver  on  WetSneaday  at  3  pm?'' 

/ - Jj'LlOHT-OF - >  FLIGHT1--  QUANTIFIER  -->  WHAT 

FLYl -  DEST-CITY-OF  ->  DENVER 

\ - TIME-OF - >  T1 - D-O-W  ->  WEDNESDAY 

\ -  T-O-D  --  T3  -->  HOURS  ->  3 

\ - >  AM-PM  ->  PM 

' '«arly  Wadnsflday  morning'' 

/ -  D-O-W  -->  WEDNESDAY 

T4  /--  HOURS  ->  8 

y -  T-O-D  -->  TS  --  BEFORE  TV 

AM-PM  ->  AM 


The  most  plausible  link  between  the  two  graphs  is  a  TIME-OF  bclwecti  FLY  1  and  T4,  but 
FL,Y1  already  has  a  TIME-OF  link  lo  Tl,  and  so  T1  and  T4  must  be  unilied.  There  is  no 
clash  between  the  D-O-W  links,  but  the  graph  unification  routine,  which  is  extended  to  do 
reasoning  about  certain  relations  such  as  “BEFORE"  above,  delects  the  clash  between  the 
T-O-D  links,  and  and  so  the  T-O-D  link  of  the  lirst  graph  is  replaced  with  that  of  the 
second. 
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N\ne  ih.ii  coiucniioiwl  uiiproacli  u>  L’lhpsix  lliai  i  ■  '\»scil  on  m.iiL'Iiing  syntactic  stiiicturcs, 
sLicIi  a^  |75|,  uotiM  have  ililTiculty  here,  sineo  iheie  is  little  syntactic  parullelisiii  in  ihi.s 
example.  The  advantage  of  our  approach  is  its  ability  to  detect  and  rei)lacc  just  those 
components  ot  structure  which  clash  on  a  semantic  level. 


4.4.2  Hallucination 


.Sup|)ose  that  ue  hit\e  a  more  telegrtiphic  iiiierance  tlial  does  not  include  the  uord  "llights": 


lldstnii  Id  Dcnvi'i  dll  A/o/iiAn  Dctui 

This  utterance  geneiittes  the  liagmcnis  "noston".  "to  Denser",  "on  Monday"  and  "Della". 
Clearly,  no  complete  set  of  links  can  he  generated  which  would  I'ully  connect  this  set, 
without  inirodueing  an  object  of  some  other  semanlic  class  such  as  IT.ICIHT  l.i  act  its  a 
"huh"  between  them, 

To  handle  these  situations,  the  Semantic  Linker  is  able  to  “hallucinate"  objects  of  certain 
semanlic  classes,  and  add  link-groups  between  iliai  hallucinaied  object  and  the  fragments 
which  arc  explicitly  present.  The  list  of  such  semantic  clas.sc.'.  is  a  domain-dependent 
parameter  of  the  Linker,  and  in  the  ATIS  domain  comprises  just  the  classes  FLIGHT  and 
GROUND-TRANSFORTATION. 

A  link  'o  a  hallucinaied  object  carries  a  substantial  penalty,  as  it  introduces  into  the 
di.seour.se  an  object  tor  which  there  may  be  only  indireet  evidence.  The  search  ordinarily 
bypasses  hallucinaied  objects,  unletts  alternatives  are  worse  or  unavailable. 

Note  that  a  reconstructive  parsing  approach,  such  as  typilied  by  [23]  or  [721,  would 
potentially  have  difficulty  with  this  example,  as  there  is  not  even  a  fragment  which  could 
plausibly  act  as  the  syntactic  head. 


4.5  After  Combination  -  Generating  the  Logical  Form 


After  the  combination  phase  is  complete,  we  have  zero  or  more  success  states  from  which 
to  generate  the  utterance  interpretation.  If  there  is  more  than  one  success  state,  the  Linker 
simply  picks  the  the  subset  of  them  with  the  highe.st  score.  In  the  ca.se  that  no  success 
states  are  found,  an  interpretation  may  still  be  generated  by  “.scavenging"  through  the  state 
space  for  the  best  partial  connection  states  found  in  the  cour.se  of  search. 
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Once  .1  coniplcic  scmamic  giapli  has  hocii  produc.  he  Linker  nuisi  still  decide  which  ol' 
the  noniin.il  seni.iniics  objects  are  to  be  displayed  saiisl'y  the  user’s  request  -  what  we 
term  the  “topic"  of  the  utterance.  Various  heuristic  -  are  used,  including  whether  the 
quantilier  of  the  nominal  in  WM,  whether  it  occurs  as  an  argument  to  a  display  verb  like 
“show",  and  whether  it  signilies  a  nun-triviul  consii.unt  on  other  nominals. 


4.6  Results  and  Discussion 

Our  complete  system  incliirling  the  Semantic  Link.:  uas  esaliiaicd  in  the  December  I'W.^ 
ARl’A  evaluation.  Lrror  rale  in  this  evaltiaiion  w  o  delincd  as  h'+NA,  where  I-  was  the 
percentage  of  queries  answered  incorrectly,  and  NA  the  percentage  of  qiicncs  not  answered 
at  all.  I’reliminary  results  indicate  that  our  system  icceiseu  an  error  rale  of  IT.KCh  on  the 
NL  test,  which  was  one  of  the  lowest  error  rates  avhievcd  by  any  of  the  paricipaiing 
systems. 

Our  experinieitis  show  that  using  the  Semantic  Linker  reduced  our  system's  error  rale  on 
the  NL  lest  by  *13%  (from  31,1'.}  to  17.8''/!-)-  The  no-answer  rate  NA  was  dramatically 
reduced  from  I8,7‘’/o  to  2.3';'}.  Just  over  80'^  of  this  increment  of  sentences  answered  were 
an.swcrcd  correctly,  so  the  Linker  showed  considerable  accuracy. 

In  conclusion,  we  have  presented  a  new  fallback  understanding  system  that  works  with 
semantic  representations  directly  instead  of  with  syntactic  structure  or  task  templates.  We 
have  also  prcsciiied  a  new  way  to  do  ellipsis  rc.soluiion  with  this  component.  Furthermore, 
this  system  has  been  proven  in  formal  tests  to  draniaticailv  improve  overall  performance. 

Several  areas  of  future  work  arc  .seen.  One  is  the  use  of  automatic  training  methods  to 
determine  feature  weights.  A  second  area  of  future  work  is  the  use  of  relational 
P'obabilities  and  search  in  the  generation  of  fragments  themselves.  A  third  aiid  last  area  of 
future  work  is  to  more  fully  integrate  the  Scmamic  Linker  into  the  regular  parsing 
mechanism  itself,  and  to  investigate  ways  in  which  parsing  can  be  viewed  as  similar  to  the 
linking  process. 


Chapter  5 


Written  Language  Training  for  Spoken 
Language  Modeling 


5.1  Introduction 


We  aitcmpted  lo  improve  recogniiion  accuracy  by  reducing  the  inadequacies  of  ihc  lexicon 
and  language  model,  Specilically  we  address  (he  following  three  problems: 

(1)  the  best  size  for  the  lexicon, 

(2)  conditioning  written  text  for  spoken  language  recognition,  and 

(3)  using  additional  training  outside  the  text  distribution. 

We  found  that  increasing  the  lexicon  20,000  words  to  40,000  words  reduced  the  percentage 
of  words  outside  the  vocabulary  from  over  2%  to  just  0.2%,  thereby  decreasing  the  error 
rate  substantially.  The  error  rate  on  words  already  in  the  vocabulary  did  not  increase 
substantially.  We  modified  the  language  model  training  text  by  applying  rules  to  simulate 
the  differences  between  the  training  text  and  what  people  actually  said.  Finally,  we  found 
that  using  another  three  years’  of  training  text  -  even  without  the  appropriate  preprocessing, 
substantially  improved  the  language  model.  We  akso  tested  these  approaches  on 
spontaneous  news  dictation  and  found  similar  improvements. 

Speech  recognition  accuracy  is  affected  as  much  by  the  language  model  as  by  the  acoustic 
model.  In  general,  the  word  error  rate  is  roughly  proportional  to  the  square  root  of  the 
perplexity  of  the  language  model.  In  addition,  in  a  natural  unlimited  vocabulary  task,  a 
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siihstaiuuil  poiiion  (if  ;lic  word  errors  come  I'rom  u  •  is  dial  are  not  even  in  ihc  recogiiiiioii 
vueabiilarv.  These  oui-or-\ocabulary  (OOV)  words  i.i^e  no  cbanee  of  being  recognized 
correcily.  Thus,  our  goal  is  to  estimate  a  good  langu.igc  model  from  the  available  training 
text,  and  to  determine  a  vocabulary  that  is  likely  to  cover  the  test  vocabulary. 

The  straightforward  solution  to  improving  the  language  model  might  be  to  increase  the 
complexity  of  the  model  (c.g.,  u.se  a  higher  order  Markov  chain)  andyor  obtain  more 
language  model  training  text.  But  this  by  itself  will  not  necessarily  provide  a  belter  model, 
especially  if  the  text  is  not  an  ideal  model  of  what  people  will  actually  say.  The  simple 
solution  to  increase  the  coverage  of  the  vocahulars  to  incretise  the  voc.ibulary  size.  But 
this  also  increases  the  word  error  rate  titid  the  com|".r.uiun  timl  size  of  the  recognition 
process. 

In  this  chapter  sve  consider  seseral  simple  technniucs  lor  improsing  the  |H)wer  of  the 
language  model.  First,  in  Section  3,  we  '“xplore  the  effect  of  increasing  the  \ocabulary  size 
on  recognition  accuracy  in  an  unlimited  vocabula. y  i,i'.k.  Second,  in  Section  4,  wc  consider 
ways  to  model  the  differences  between  the  language  model  training  text  and  the  way 
people  actually  speak.  And  third,  in  Section  5,  we  show  that  simply  increasing  the  amount 
of  language  model  training  helps  significantly. 


5.2  The  WSJ  Corpus 


The  November  1993  ARPA  Continuous  Speech  Recognition  (CSR)  evaluations  was  based 
on  speech  and  language  taken  from  the  Wall  Street  Journal  .VSJ).  The  standard  language 
model  training  text  was  estimated  from  about  35  million  words  of  text  extracted  from  the 
WSJ  from  1987  to  1989.  The  text  was  normalized  (preprocessed)  with  a  model  for  what 
words  people  use  to  read  open  text.  For  example,  “S234.56"  was  always  assumed  to  be 
read  as  "two  hundred  thirty  four  dollars  and  fifty  six  cents”.  “March  13"  was  always 
normalized  as  "March  thirteenth"  -  not  "March  the  thirteenth”,  nor  “March  thirteen”.  And 
so  on. 

The  original  processed  text  contains  about  160,(X)0  unique  words.  However,  many  of  these 
are  due  to  misspellings.  Therefore,  the  test  corpus  was  limited  to  those  sentences  that 
consisted  only  of  the  most  likely  64,000  words.  While  this  vocabulary  is  still  quite  large,  it 
has  two  beneficial  effects.  First,  it  greatly  reduces  the  number  of  misspellings  in  the  texts. 
Second,  it  allows  implementatirns  to  use  2-byte  data  fields  to  represent  the  words  rather 
than  having  to  use  4  bytes. 

The  “standard"  recognition  vocabulary  was  dcnucd  as  the  most  l.’kely  20,000  words  in  the 
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corpus.  Then,  ihc  siandard  hiiigudpc  model  wa.s  dcimod  as  a  inprani  language  model 
eslimaled  speeilically  for  these  20K  words.  This  siaiidard  m  idel,  provided  by  Lincoln 
Laboratory,  was  to  be  used  for  the  controlled  portion  of  the  recognition  tests.  In  addition, 
partieipants  were  encouraged  to  generate  an  improved  language  model  by  any  means  (other 
titan  c.vamining  the  test  data). 


5.3  Recognition  Lexicon 


VVe  find  thitt.  tyjtically.  over  2'!  of  the  word  occiiirenccs  in  a  development  sot  are  not 
included  in  the  standard  2()K-word  socabulary.  Naturally,  words  that  are  not  in  the 
vocabulary  cannot  be  recogni/eil  accurately.  (At  best,  we  migbi  try  to  detect  that  there  is 
one  or  more  unknown  words  at  this  point  in  a  sentence,  and  then  attempt  to  recognize  the 
phoneme  sequence,  and  then  guess  a  possible  letter  sequence  for  this  phoneme  sequence. 
LInfortunately.  in  Lnglish,  even  if  we  could  recogm/e  the  phonemes  perfectly,  there  are 
many  valid  ways  to  spell  ti  particular  [ihoncmc  sequence.)  However,  in  addition  to  this 
word  not  being  recomiizx  ’  •  often  see  that  one  or  two  words  adjacent  to  this  missing 
word  arc  also  ini'-'  c  lis  is  because  the  recognition,  in  choosing  a  word  in  its 

vocabulary,  also  kv  mg  context  for  the  following  or  preceding  words.  In 

general,  we  (hui  thiU  .rror  rate  incrca,scs  by  about  1.5  to  2  times  the  number  of 

out-of-vocabu'  .>  OO  s. 

One  simple  way  u  Iccrca.v.  nc  percentage  of  OOV  words  is  to  increase  the  vocabulary 
size.  But  which  words  ,hould  be  added?  The  obvious  solution  is  to  add  words  in  order  of 
their  relative  frequency  within  the  fuM  text  corpus.  There  are  ■■■''veral  problems  that  might 
result  from  this: 


1 .  The  vocabulary  might  have  to  be  extremely  large  before  the  OOV  rate  is  reduced 
significantly. 

2.  If  the  word  error  rate  for  the  vast  majority  of  the  words  that  arc  already  in  the 
smaller  vocabulary  increased  by  even  a  small  amount,  it  might  offset  any  gain 
obtained  from  reducing  the  OOV  rale. 

3.  The  language  model  probabilities  for  these  additional  words  would  be  quite  low, 
which  might  prevent  them  from  being  recognized  anyway. 


We  did  not  have  phonetic  pronunciations  for  all  of  the  64K  words.  We  sent  a  list  of  the 
(approximately  34K)  words  for  which  we  had  no  pronunciations  to  Boston  University. 
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They  I'lHiiul  pfoininciaiions  Tor  ahom  hall'  i  ISK)  of  iho  vvoals  in  ilicir  (cxpandctl  Mohy) 
diciioiiars .  VN'hcii  wi;  added  liicsc  words  lo  mir  WSJ  diciionuiy,  we  liad  a  loial  of  50K 
words  lira!  we  could  use  for  recogiiiiion. 

The  following  table  shows  the  percentage  of  OOV  words  as  a  function  of  the  vocabulary 
size.  The  mcasurenicm  was  done  on  the  WSJl  Hubl  “20K''  development  test  which  has 
■2,464  unique  words  with  the  total  count  of  8.227  words.  Due  to  the  unavailability  of 
phonetic  pronunciations  (mentioned  above),  the  final  vocabular)  size  would  be  the  second 
column. 


fop  N  i 

Vocab. 

lioov 

,  ^ 

20k 

19998 

!  ■  187 

:  s  27  ’ 

30k 

28247 

8.3 

!  1.03 

4()k  • 

.98 

i  39 

1  0.47 

48k  ; 

402 1 3 

1  13 

1  0.17 

fiOk  ' 

41363 

j  13 

1  0.1.3 

()4k  ; 

48.386 

1  0.01 

Table  .S.!:  Out  of  vocabulary  words  as  a  function  of  vocabulary  size 


We  were  somewhat  surprised  to  see  that  the  percentage  of  OOV  words  was  reduced  to 
only  0,17%  when  the  lexicon  included  the  most  likely  40K  words  -  especially  given  that 
many  of  the  most  likely  words  were  not  available  because  we  did  not  have  phonetic 
pronunciations  for  them.  Thus,  it  was  not  necessary  to  increase  the  vocabulary  above  4GK 
words.  .. 

The  second  worry  was  that  increasing  the  vocabulary  by  too  much  might  increase  the  word 
error  rate  due  to  the  increased  number  of  choices.  For  example,  normally,  if  we  double  the 
vocabulary,  wc  might  expect  an  increase  in  word  error  rate  of  about  40% !  So  we  performed 
an  experiment  in  which  we  used  the  standard  20K  language  model  for  the  5K  development 
data.  We  found,  to  our  surprise,  that  the  error  rate  increased  only  slightly,  from  8.7%  to 
9.3%.  Therefore,  wc  fell  conlidcni  that  we  could  increase  the  vocabulary  as  needed, 

We  considered  possible  explanations  for  the  small  incrca.se  in  ciror  due  lo  a  larger 
vocabulary.  We  realized  that  the  answer  was  in  the  language  model.  In  the  first  case,  when 
we  just  increase  the  vocabulary,  the  new  words  also  nave  the  same  probability  in  the 
language  model  as  the  old  words.  However,  in  this  case,  all  the  new  words  that  were  added 
had  lower  probabilities  (at  least  for  the  unigram  model)  than  the  existing  words.  Let  us 
consider  two  possibilities  that  we  would  not  falsely  substitute  a  new  word  for  an  old  one. 
If  the  new  word  were  acoustically  similar  to  one  of  the  words  in  the  test  (and  therefore 
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siinihir  lit  ;i  woiii  in  tlie  oii(;mal  vocabulaiy,  tlien  ilio  word  would  be  correctly  recognized 
because  the  origii.’ul  word  would  always  base  u  higher  language  tnodel  probability.  If,  on 
the  other  hand,  the  new  word  were  acou.stically  very  different  from  the  word  being  spoken, 
then  we  might  c.xpect  that  our  acoustic  models  would  prevent  the  new  word  from  being 
cho.sen  over  the  old  word,  While  the  argument  makes  .some  sense,  we  did  not  expect  the 
loss  for  increasing  the  vocabulaiy  from  5K  words  to  20K  words  to  be  so  small. 

rinally,  the  third  question  is  whether  the  new  words  would  be  recognized  when  they  did 
occur,  since  (as  mentioned  above)  their  language  model  probabilities  weie  generally  low. 

In  fact,  we  found  iluii.  n\en  though  the  error  rate  for  ihcse  new  words  was  higher  than  for 
the  more  likely  words,  we  were  still  able  to  recognize  about  to  70'i('  of  them 
eorrecily,  presumably  ba>ed  largely  on  the  acoustic  model.  Thus,  the  net  effect  of  this  was 
to  reduce  the  word  error  rate  by  about  !''■(  to  I..')');,  absolute, 


5.4  Modeling  Spoken  Language 


Another  effect  that  we  worked  on  was  the  difference  bci'’'ccn  the  processed  text,  us  defined 
by  the  preprocessor,  and  the  words  that  people  actually  used  when  reading  WSJ  text.  In 
the  pilot  WSJ  corpus,  the  subjects  were  prompted  with  texts  that  hud  already  been 
“normalized",  so  that  there  was  no  ambiguity  about  how  to  read  a  .sentence.  However,  in 
the  WSJ  I  corpus,  subjects  were  instructed  to  read  the  original  text,s  and  to  say  whatever 
seemed  most  appropriate  to  them.  Since  the  WSJl  prompting  texts  were  not  normalized  to 
deterministic  word  sequences,  subjects  sliowed  considerable  variability  in  their  leading  of 
the  prompting  text. 

However,  tlie  standard  language  model  was  derived  from  the  normalized  text  produced  by 
the  preprocessor.  This  resulted  in  a  mismatch  between  'he  language  model  and  the  actual 
word  sequences  that  were  spoken.  While  the  preprocessor  was  quite  good  at  predicting 
what  people  said  most  of  the  time,  there  were  several  cases  where  people  used  different 
words  than  predicted.  For  example,  the  preprocessor  predicted  that  strings  like  “$234" 
would  be  read  as  “two  hundred  thiity  four  dollars”.  But  in  fact,  most  i)eoplc  read  this  as 
“two  hundred  AND  thirty  four  dollars”.  For  another  extreme  example,  the  preprocessor’s 
prediction  of  “10.4”  was  “ten  point  four",  but  the  subject  (in  the  WSJl  development  data) 
read  this  as  "ten  and  four  tenths”.  There  were  many  other  similar  examples. 

The  standard  model  for  the  tests  was  the  “nonverbalized  punctuation”  (NVP)  model,  which 
assumes  that  the  readers  never  speak  any  of  the  punctuation  words.  The  other  model  that 
bad  been  defined  was  the  “verbalized  punctuation"  (VP)  model,  which  assumed  that  all  of 
•  the  punctuation  was  read  out  loud.  This  year,  the  subjects  were  instructed  that  they  were 
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free  to  iv>ul  ilio  piiiK'iuaiioii  out  loud  or  not.  in  uh.nevcr  w.iy  they  Iccl  most  coml'orlablc. 

It  tul■n^  oLii  that  people  didn't  \eibali/e  most  puiu .uaiion.  llouever,  they  regularly 
verbali/ed  quotation  marks  in  many  dilTereni  ways  that  were  all  differetit  than  tiu;  ways 
predicted  by  the  standard  [xeprocessor. 

-There  were  also  several  words  that  were  read  dil'I'erently  by  subjeets.  For  example,  subjects 
•  pronounced  abbreviations  like,  “CORP."  and  "INC.".  While  the  preprocessor  assumed  that 
ul!  abbreviations  would  be  read  as  full  v-  ords 


We  u.sed  two  methods  to  model  the  ways  people  actually  read  text.  The  simpler  approach 
was  to  ineluile  the  text  of  the  acoustic  traminii  data  in  the  language  model  training.  That  is, 
we  simply  added  the  .■^7K  sentence  transcriptions  from  the  acoustic  training  to  the  2M 
sentences  of  training  text.  The  ativantage  of  this  method  is  that  it  modeled  what  people 
actually  said.  The  system  w  ;  .  delinitely  more  likely  to  recognir.e  words  or  sequences  that 
were  pres  iously  impttssible  The  problem  with  this  method  was  that  the  amount  of 
transcribed  speech  was  quite  small  (about  50  times  smaller)  compared  to  the  original 
training  text.  We  tried  repeating  the  transcriptions  several  times,  but  we  found  that  the 
effect  was  not  its  strong  as  we  would  like. 


A  more  powerful  approach  was  to  simulate  the  effects  of  the  different  word  choices  by 
simple  rules  which  wore  applied  to  all  of  the  35.V1  words  of  language  training  text.  We 
chose  to  u.sc  the  following  rules; 


Preprocessed  Text 
HUNDRED  [number] 
ONE  HUNDRED 
ONE  DOLLAR 
ZERO  POINT  [number] 
AND  ONE  HALF 
AND  ONE  QUARTER 


Simulated  Text 
HUNDRED  AND  jiiumber] 
A  HUNDRED 
A  DOLLAR 
POINT  [number] 

AND  A  HALF 
AND  A  QUARTER 


Thus,  for  example,  if  the  sentence  consists  of  the  pattern  "hundred  twenty",  we  repeated 
the  same  sentence  with  "hundred  AND  twenty”. 


The  result  was  that  about  one  fifth  of  the  sentences  in  llie  original  corpus  had  some  change 
reflecting  a  difference  in  the  way  subjects  read  the  original  text  Thus,  this  was  equivalent 
in  weight  to  an  equal  amount  of  training  text  to  the  original  text. 

We  found  that  this  preprocessing  of  the  text  was  sufficient  to  cover  most  of  those  cases 
where  the  readers  said  things  differently  than  the  predictions.  The  recognition  results 
showed  that  the  system  nov'  usually  recognized  the  new  word  sequences  and  abbreviations 
correctly. 
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5.5  Increasing  the  Language  Model  Training 

While  35M  woi'ds  niuy  seem  like  a  Idi  of  daia.  it  is  not  enouiih  to  cover  all  of  the  trigrains 
that  arc  likely  to  (x;cur  in  the  testing  data.  So  we  considered  other  sources  for  additional 
language  modeling  text.  The  only  easily  accessible  data  available  was  an  additional  3  years 
‘(from  1990-1992)  of  WSJ  data  from  tlte  TIPSTLiR  cinpus  produced  by  the  Linguistic  Data 
Consortium  (LPC), 

Iloweser,  there  vsere  tvso  piobleins  uith  iisinu  this  .l.iia.  First,  since  the  tost  data  was 
known  to  come  fri'in  1987-1989.  we  were  concerned  that  this  might  actually  hurt 
performance  due  to  sonic  dilTerences  in  tlie  topics  ilnring  that  d-year  periwl.  Second,  this 
text  had  mu  been  normalized  with  the  preprocessor  and  wc  did  not  have  available  to  ns  the 
preprocessor  that  was  used  to  transform  the  raw’  test  into  word  soiiuenees. 

We  decided  to  use  the  new  text  xvith  minimal  processing.  The  text  was  liltereil  to  ,emo\e 
all  tables,  captions,  iiuinbers,  etc.  We  replaced  each  initial  example  of  double-quote  (‘‘) 
with  "QUOTH  and  the  matching  token  xviih  "LINQUOTE  or  "HNDQUOTE,  which  w'ere 
the  most  comtnon  ways  these  words  were  .said.  No  other  changes  were  made.  We  just  used 
the  raw  text  as  it  was.  One  benelit  of  this  was  that  abbreviations  were  left  as  they  appeared 
iit  the  text  rather  than  expanded.  Any  numbers,  dates,  dollar  amounts,  etc,  were  just 
considered  “unknown"  words,  and  did  not  contribute  to  the  training.  Wc  assumed  that  we 
had  suflicient  examples  of  numbers  in  the  original  text, 

Wc  found  that  adding  this  additional  language  tiaining  data  reduced  the  error  by  about  17c 
of  the  error,  indicating  that  the  original  35  million  words  was  not  sufficient  for  the  models 
wc  were  using.  Thus,  the  addition  of  plain  text,  even  though  it  was  from  a  different  three 
years,  and  had  many  gaps  due  to  apparent  unknown  words,  still  improvcti  the  recognition 
accuracy  considerably, 


5.6  Results 


The  following  table  shows  the  benefit  of  the  enlarged  40K  lexicon  and  the  enhanced 
language  model  training  on  the  OOV  rate  ard  the  word  error  for  the  development  test  and 
the  evaluation  test. 

Surprisingly,  the  addition  of  three  year’s  LM  training  (from  a  period  post-dating  the  test 
data)  improved  performance  on  the  utterances  that  were  completely  inside  the  vocabulary. 
Evidently,  even  the  common  trigrams  are  poorly  trained  with  only  the  35  million  word 
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'f  OOV 

i  Word  Eiror  | 

Test  Set 

20K 

40K 

20ig 

40K 

Development 

2.27 

0.17 

16.4  ' 

I2.y 

Evaluation 

1,8.3 

0.23 

.  14,2 

1  -  -  1 

12.2 

Tabic  5.2;  Enlarging  tlic  lexicon  improves  OOV  rale  and  error  rale 


WSJO  corpus.  Overall,  our  inodilicuiions  lo  llic  lev. con  and  grammar  iraining  reduced  llie 
word  error  by  l4-22'.'( . 


5.7  Spontaneous  Dictation 


Anollier  area  we  investigaicd  was  sponianeous  dictaiion.  The  subjects  were  primarily 
former  or  praciieing  Journalisis  with  some  experience  ai  dictation.  They  were  instructed  to 
dictate  general  and  linaiicial  news  stories  that  would  l-tc  appropriate  for  a  newspaper  like 
WSJ.  In  general,  the  Journalisis  cho.se  topics  of  recent  'uicrcst.  This  meant  that  the  original 
language  model  was  often  out  of  date  for  the  subject.  As  a  result,  the  iwrcentage  of  OOV 
words  increased  (lo  about  4'7f),  and  the  language  model  taken  from  WSJ  text  was  less 
appropriate. 

The  OOV  words  in  the  sponianeous  data  were  more  likely  to  be  proper  nouns  from  recent 
events  that  were  not  covered  by  the  LM  training  material.  To  counter  this,  we  added  all 
(1,028)  of  the  new  words  that  were  found  in  the  spontaneous  portion  of  the  acoustic 
training  data  in  WSJl.  This  mostly  included  topical  names  (e.g.,  Hillary  Rodham,  NAFTA, 
etc.). 

In  order  to  account  for  some  of  the  differences  between  the  read  text  and  the  spontaneous 
text,  and  to  have  language  model  probabilities  for  the  new  words,  we  added  the  training 
tran.scriptions  of  the  spontaneous  dictation  (about  8K  .sentences)  to  the  LM  training  as  well. 

New  weights  for  the  new  language  model,  HMM,  and  .Segmental  Neural  Network  were  all 
optimized  on  spontaneous  development  test  data.  The  table  below  .,hows  that  the  OOV 
remains  near  1%  even  after  the  enlargement  to  a  4 IK  lexicon. 

As  can  be  seen,  increasing  the  vocabulary  size  from  20K  to  40K  significantly  reduced  the 
OOV  rate.  It  is  important  to  point  out  that  in  this  case,  we  did  not  have  the  benefit  of  a 
word  frequency  list  for  spontaneous  stjeech,  and  that  the  source  of  speech  had  an  unlimited 
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,  OOV 

Word  Error 

Test  Set 

20K 

40K 

4IK 

20K 

41K 

Development 

2.9 

1.4 

0.8 

- 

21.7 

Evaluation 

4.8 

1.9 

1.5 

24.7 

19.1 

Table  5.3:  OOV  rate  and  error  rate  for  41  K  lexicon 


vocabulary.  So  the  reduction  itt  OOV  rate  is  certainK  a  lair  -  if  not  pessimistic  -  estimate 
of  the  real  hencii:  from  increasing  the  vocabulary.  A  iding  the  few  new  words  observed  in 
the  spontaneous  speech  also  helped  somewhat,  but  not  nearly  as  much.  The  sample  of  only 
8,0(X)  sentences  is  clearly  not  sufficient  to  lind  all  the  new  words  that  people  might  use. 
Presumabl>',  if  the  sample  of  spontaneous  speech  were  large  enough  to  derive  word 
frequencies,  then  we  could  choose  a  much  better  list  of  40K  words  with  a  lower  OOV  rate. 

Overall,  the  41K  trigiam  reduces  the  word  error  by  237r  over  the  20K  standard  trigram  on 
the  November  '93  CSR  S9  evaluation  test.  We  estimate  that  more  than  half  of  this  gain 
was  due  to  the  decreased  percentage  of  OOV  words,  and  the  remainder  was  due  to  the 
increased  language  model  training,  including  specific  examples  of  spontaneous  dictation. 


5.8  Conclusions 


We  found  the  following  interesting  results: 


•  Expanding  the  vocabulary  with  less  frequent  words  does  not  substantially  increase 
the  word  error  on  those  words  already  in  the  vocabulary,  but  does  eliminate  many 
errors  due  to  OOV  words. 

■  Doubling  the  amount  of  language  model  training  text  improves  the  language  model, 
even  though  the  text  comes  from  different  years  than  the  test,  and  even  though  the 
text  was  not  preproccssed  into  proper  lexical  forms. 

•  It  is  possible  to  improve  the  quality  of  the  language  modeling  text  by  modeling  the 
differences  between  the  predicted  reading  style  and  some  examples  of  actual 
transcriptions, 

t  Increasing  the  vocabulary  size  and  language  training  liad  a  bigger  effect  on 
spontaneous  speech  than  it  did  for  read  speech. 


Chapter  6 


HUM  -  Hidden  Understanding  Model 


6.1  Introduction 


ulii  this  chapicr,  we  describe  and  evalualc  hidden  underslanding  models,  a  statislicul 
learning  approach  to  natural  language  understanding.  Given  a  string  of  words,  hidden 
understanding  models  determine  the  most  likely  meaning  for  the  siring.  We  discuss  (1)  the 
problem  of  representing  meaning  in  this  framework,  (2)  the  structure  of  the  statistical 
model,  (3)  the  process  o,'  training  the  model,  and  (4)  the  process  of  underslanding  using  the 
model.  Finally,  we  give  experimental  results,  including  results  on  an  ARPA  evaluation. 

Hidden  understanding  models  are  an  innovative  class  of  statistical  mechanisms  that,  given 
a  string  of  words,  determines  the  most  likely  meaning  for  the  siring.  The  overall  approach 
represents  a  substantial  departure  from  traditional  techniques  by  replacing  hand-craflcd 
grammars  and  rules  with  statistical  models  that  arc  automatically  learned  from  examples. 
Hidden  understanding  models  were  primarily  motivated  by  techniques  that  have  been 
extremely  successful  in  speech  recognition,  c.spcciallv  hidden  Markov  models  [181.  Related 
techniques  have  previously  been  applied  to  the  problem  of  identifying  concept  sequences 
within  a  sentence  [52].  In  addition,  the  approach  contains  elements  of  other  natural 
language  processing  techniques  including  semantic  grammars  [76,  32],  augmented 
transition  networks  (ATNs)  [78],  probabilistic  parsing  [29,  27,  59],  and  automatic  grammar 
induction  [51], 

Hidden  understanding  models  are  capable  of  learning  a  variety  of  meaning  representations, 
ranging  from  simple  domain-specific  representations,  to  ones  at  a  level  of  detail  and 
sophistication  comparable  to  current  natural  language  systems.  In  fact,  a  hidden 
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uiulcrst.uiiiing  iiuxlol  cun  be  used  to  produce  a  rep  cniation  will)  essciilially  tlie  same 
information  content  as  the  semantic  graph  used  b\  iic  Delphi  system  122],  a  general 
purpose  NLP  system,  which  utilizes  a  modified  Deiimlc  Clause  Grammar  formalism.  This 
fact  made  it  possible  to  interface  a  hidden  under.slamling  system  to  the  discourse  processing 
and  data-  ba.se  retrieval  components  of  Delphi  to  moducc  a  complete  end  to  end  system. 
This  hybrid  system  participated  in  the  IffOS  ATIS  natural  language  evalualion.  Although 
only  four  months  old,  the  scores  achieved  by  the  comhinetl  sysic-m  were  quite  res|)eclable. 

Because  of  differences  between  language  understanding  and  speech  recognition,  significant 
changes  are  required  in  the  hidden  Markov  model  f.cthodology.  Unlike  speech,  where 
each  |>honeme  results  in  a  local  sequence  of  specii.i.  the  relation  betsveen  the  meaning  of  a 
sentence  and  the  sequence  of  words  is  not  a  simpf'  linear  sequential  model.  Language  is 
itihcreiitly  nested,  with  subgroups  of  concepts  within  other  concepts. 

A  statistical  system  for  understanding  language  imiM  take  this  and  other  differences  into 
account  in  its  overall  design.  In  principle,  we  ha'c  the  following  rei|uiremcnts  for  a  hiiklen 
understanding  system: 

•  A  iiotationul  sy.stem  for  expressing  meanings. 

•  A  statistical  model  that  is  capable  of  representing  meanings  and  the  association 
between  meanings  and  words, 

•  An  automatic  training  program  which,  given  pairs  of  meanings  and  word  sequences, 
can  estimate  the  parameters  of  a  statistical  model. 

•  An-understanding  program  that  can  .search  the  statistical  model  to  (ind  the  most 
likely  meaning  given  a  word  sequence. 


Below,  we  describe  solutions  for  each  of  these  requirements,  and  describe  the  relationship 
of  these  solutions  to  other  work  in  stochastic  grammars  and  probabilistic  ptusing.  Pinally, 
we  will  report  on  initial  exp .trhnents  with  hidden  understanding  models. 


6.2  Expressing  Meanings 


One  of  the  key  requirements  for  a  hidden  understanding  model  is  that  the  meaning 
representation  must  be  both  precise  and  appropriate  for  automatic  learning  techniques, 

Spccilii,.dly,  we  require  a  meaning  'epresentation  that  is: 
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>  training 

meaning 

program 

expressions 
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model 
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^  meaning 

program 

e.xpressions 

I'ijjurc  6,1:  The  Main  Components  of  a  llidden  Umleislanding  System 

•  lixpfcs.sivo,  It  must  be  able  to  express  meanings  over  the  entire  range  of  iitterunees 
lliut  arc  likely  to  (Kcar  in  an  applieation. 

•  Annotatablc.  It  must  be  possible  to  produce  accurate  annotations  for  a  suHicienily 
large  corpus  with  an  acceptable  level  of  human  effort, 

•  Tramable,  It  must  be  possible  to  estimate  the  model  parameters  from  a  reasonable 
nuiiibcr  of  training  examples, 

•  Tractable,  There  must  be  a  computationally  tractable  algorithm  capable  of  searching 
the  meaning  space. 


In  order  to  facilitate  annotation  of  a  training  corpus,  meaning  expressions  should  be  as 
simple  us  possible.  Frame  based  representations,  such  as  the  example  shown  in  Figure  6,2, 
have  the  advantage  that  they  arc  relatively  simple  .o  understand,  A  difficulty  with  this  style 
of  representation  is  that  the  frames  do  not  align  directly  to  the  words  of  the  sentences.  In 
particular,  a  meaning  frame  contains  few  explicit  clues  as  to  how  the  words  of  a  sentence 
imply  the  structural  characteristics  of  the  frame.  Tree  structured  meaning  representations, 
discussed  in  the  next  section,  have  the  advantage  that  they  can  be  fully  aligned  to  the 
words  of  a  sentence.  The  co.st  is  that  these  tree  structured  representations  are  more  detailed 
than  their  frame  based  counterparts,  thereby  requiring  greater  annotation  effort. 
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SHOW; 

FLIGHl'S: 

TIME: 

PART-OF-DAY:  moniing 
ORIGIN: 

CITY:  Boston 
DEST: 

CITY;  San  Francisco 
DATE: 

DAY-OF -WEEK:  Tuesday 

Please  show  me  morning  flights  from  Boston  to  San 
Francisco  on  Tuesday. 


I-iguii;  fi.2;  A  I'raiiic  Based  Meaning  Rcprescnlalion 


Boilunalely,  (he  techniques  developed  lor  lice  structured  representations  cun  be  extended  to 
simpler  frame  representations  as  well. 


6.2.1  Tree  Structured  Meaning  Representations 

The  central  characteristic  of  a  tree  structured  representation  is  that  individual  concepts 
appear  as  nodes  in  a  tree,  with  component  concepts  appearing  as  nodes  attached  directly 
below  them.  For  example,  the  concept  of  a  flinht  in  the  AXIS  domain  has  component 
concepts  including  airline,  flight  number,  origin,  and  dextination.  These  could  then  form 
part  of  the.  representation  for  the  phrtise:  United  flight  203  from  Dalla.';  to  Atlanta.  The  use 
of  a  hierarchical  representation  is  one  characteristic  that  distinguishes  hidden  understanding 
models  from  earlier  work  in  which  meaning  is  represented  by  a  linear  sequence  of 
concepts  152]. 
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rigurc  6.3:  A  Tieo  Simcuircd  Moaning  Reprcsemuiion 


A  requiicnicni  for  irco  siiucuiicd  rcprcseniaiion.s  is  that  the  order  of  ihc  coniponenl 
concepts  must  match  the  order  of  the  words  they  eorrespond  to.  Titus,  the  representation  of 
the  phrase  flight  203  to  Atlanta  from  Dallas  on  United  includes  the  same  nodes  as  the 
earlier  example,  but  in  a  different  order.  For  both  examples,  ho'cever,  the  interpretation  i.s 
identical. 

At  the  leaves  of  a  meaning  tree  are  the  words  of  the  sentence.  We  distinguish  between 
nodes  that  appear  above  other  nodes,  and  those  that  appear  directly  above  the  words. 

These  will  be  referred  to  as  nonterminal  nodes  and  terminal  nodes  respectively,  forming 
two  disjoint  sets.  No  node  has  both  words  and  other  nodes  appearing  directly  below  it. 

Figure  6.3  shows  an  example  of  a  typical  meaning  tree.  In  this  example,  ihe  flight  node 
represents  the  abstract  concept  of  a  flight,  which  is  a  structured  entity  that  may  contain  an 
origin,  a  destination,  and  other  component  concepts.  Appearing  directly  above  the  word 
"flight”  is  a  terminal  node,  which  we  call  a  flight  indicator.  This  name  is  chosen  to 
distinguish  it  from  the  flight  node,  and  also  because  the  word  flight,  in  some  sense, 
indicates  the  presence  of  a  flight  concept.  Similarly,  there  are  airline  indicators,  origin 
indicators,  and  destination  indicators. 
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Figure  6.4:  A  S(xcializeci  Sublanguage  Analysis  and  a  Full  Symaclic  Analysis 

One  view  of  these  tree  structured  representations  is  that  they  are  parse  trees  produced 
according  to  a  semantic  grammar.  In  this  view,  the  dominance  relations  of  the  grammar 
are  predetermined  by  the  annotation  schema,  while  the  precedence  relations  are  learned 
from  the  training  examples. 


6.2.2  Atternative  Tree  Representations 


Tree  structured  meaning  expressions  can  range  in  complexity  from  simple  special  purpose 
sublanguage  representations  to  the  structural  equivalent  of  detailed  syntactic  parse  trees. 
The  possibilities  are  limited  only  by  two  fundamental  requirements:  (1)  semantic  concepts 
must  be  hierarchically  nested  within  a  tree  structure,  and  (2)  the  sets  of  terminal  and 
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iKiiucniuiKil  nodes  imisi  reiiuiin  disjoini.  lioih  ol  -sc  rci.iiiironioin.s  can  he  salislicd  by 
iiccs  possessing  most  of  the  slriiciurai  cliaractcn^...  ^  of  coiueniional  syiilaciic  parse  trees. 
Since  ourobjcciivo  is  to  model  meaning,  the  node-  must  still  be  labeled  to  relleet  semantic 
categories.  However,  additional  and  augmented  labels  may  be  introduced  to  relleet 
sytitactic  categories  as  well. 

Representations  of  this  form  contain  significantly  more  internal  structure  than  specialized 
sublanguage  models.  This  can  be  seen  in  the  e.sa,n['le  in  Figure  6.4.  The  specialized 
sublanguage  represcnialion  requires  only  seven  nod.cs,  while  a  full  syntactically  motivated 
analysis  requires  lil'teen.  The  additional  nodes  arc  iscd  to  distinguish  what  is  being  shown 
to  whom,  to  relleet  the  fact  that  the  stopover  phru'C  is  part  of  .i  relaiisc  clause,  and  to 
determine  the  internal  siructuie  of  the  relative  claiiso. 

One  interesting  characieristic  of  these  more  clabor.iie  trees  is  their  similarity  to  those 
produced  by  classical,  linguistically  motivated,  natural  language  systems.  Thus,  a  hidden 
understanding  model  can  serve  to  replace  the  pari-of-  speech  tagger,  parser,  and  semantic 
interpreter  of  a  classical  system.  Instead  of  writing  grammar  and  .semantic  interpretation 
rules  by  hand,  the  training  program  automatically  cmistructs  a  statistical  model  from 
examples  of  meaning  trees. 

Regardless  of  the  details  of  the  tree  structure  and  labels,  the  components  comprisitig  a 
hidden  understanding  .system  reiiuiin  unchanged.  The  only  difference  is  in  how  the  system 
is  trained. 


6.2.3  Frame  Based  Representations 

One  way  to  think  of  a  frame  based  meaning  is  as  a  partially  specilied  tree  in  which  some 
words  arc  not  accounted  for.  Nevertheless,  a  frame  representation  is  a  complete  meaning 
representation  in  the  sense  that  it  fully  specifies  the  concepts  and  structure  comprising  the 
meaning.  In  terms  of  a  tree  structured  representation,  the  set  of  nonterminal  nodes  is  fully 
specified,  while  some  of  the  terminal  nodes  may  be  omitted. 

The  missing  terminal  nodes  are  said  to  be  hidden,  in  the  sense  that  every  word  is  required 
to  align  to  some  terminal  node,  but  the  alignment  is  not  nccessariW  given  by  the  meaning 
frame.  These  hidden  nodes  must  later  be  aligned  as  part  of  the  training  process.  The 
general  idea  is  to  assign  a  small  number  of  free  terminal  nodes  (typically  one  or  two) 
beneath  every  nonterminal  node.  These  are  then  free  to  align  to  any  unassigned  words, 
provided  that  the  overall  tree  structure  is  not  violated.  An  EM  algorithm  (Estimate- 
Maximize)  is  used  to  organize  the  unassigned  terminal  nodes  into  classes  that  correspond 
to  individual  words  and  phrases,  and  that  bind  to  particular  abstract  concepts.  Figure  6.5 
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f'igurc  6.5:  A  Tree  Sliiicture  Coircspondiiig  to  a  ITainc  Repre.senlatioii 

shows  (he  complete  meaning  tree  with  hidden  nodes  corresponding  to  the  frame  in  Figure 

6.2, 

If  We  consider  tree  structured  meaning  expressions  as  parse  trees  which  arc  generated 
according  to  some  incompletely  speciiied  graiiiniar,  liien  the  problem  of  aligning  the 
hidden  nodes  can  be  considered  as  a  grammar  induction  problem.  In  this  way,  the  problem 
of  aligning  the  hidden  nodes  given  only  a  partially  specified  .set  of  trees  is  analogous  to  the 
problem  of  fully  parsing  a  training  coipus  given  only  a  partial  bracketing.  The  difference  is 
that  while  a  partial  bracketing  determines  constituent  boundaries  that  cannot  be  crossed,  a 
partially  specified  tree  determines  .structure  that  must  be  preserved. 
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6.3  I'he  Statistical  Model 


Om;  central  characienslic  of  hidden  nnderstandiiig  nuidels  is  that  iliey  are  i>cncnttivc.  From 
this  viewpoint,  language  is  produced  by  a  two  component  statistical  process.  The  iirsi 
component  chooses  the  meaning  to  be  expressed,  elYectivcly  deciding  what  to  say.  The 
'second  component  selects  word  sequences  to  express  that  meaning,  elTectively  deciding 
how  to  say  it.  The  lirst  phase  is  referred  to  as  the  :.i  nuiiilic  Itinuiuinc  nuhlel.  and  can  be 
thought  of  as  a  stochastic  process  that  produces  me.iiiing  c.spressions  selected  from  a 
'.inivorsc  of  meanings.  The  second  phase  is  referred  :o  as  the  hwlcal  n’a\i:iiiion  niodi’l.  and 
can  be  thought  of  as  a  stochastic  process  th.it  genei..'es  woriK  once  a  metining  is  given. 

By  analogy  v,  ith  hidden  Markov  models,  \se  refer  a-  the  combmaiion  of  these  iwo  models 
as  a  hidden  understtinding  model,  The  word  hidden  refers  to  the  fact  that  only  words  can 
be  observed.  The  internal  stales  of  each  of  the  two  models  are  unseen  and  must  be  inferred 
from  the  words.  The  problem  of  language  understanding,  then,  is  to  recover  the  most  likely 
meaning  structure  given  a  seqtience  of  words.  More  formally,  understanding  ,<  word 
sequence  VV  is  accomplished  by  searching  among  all  possible  meanings  for  some  meaning 
M  such  that  P(M\W)  is  maximi/ed.  By  Bayes  Rule.  P{M\W)  can  be  tewrilten  as; 


1\W\M)P(M) 
Ft  HO 


Now,  since  P(W)  docs  not  depend  on  M,  maximi/.ing  P(M\W)  is  equivalent  to 
maximizirig  the  product  P{W\M)  P(M).  However,  P(M  'T)  is  simply  our  lexical 
realization  model,  and  P(M)  is  simply  our  .semantic  language  model.  Thus,  by  searching  a 
combination  of  these  models  it  is  possible  to  find  the  maximum  likelihood  meaning  M 
given  word  sequence  W.  Considering  the  statistical  model  as  a  stochastic  grammar,  the 
problem  of  determining  M  given  W  is  analogous  to  the  problem  of  finding  the  most  likely 
derivation  for  W  according  to  that  grammar. 


6.3.1  Semantic  Language  Model 

For  tree  structured  meaning  representations  individual  nonterminal  nodes  determine 
particular  abstract  .semantic  concepts.  In  the  semantic  language  model,  each  abstract 
concept  corresponds  to  a  pmhubilixtic  slate  transiiUm  network.  All  such  networks  are  then 
combined  into  a  single  probabilistic  recursive  transition  network,  forming  the  entire 
semantic  language  model. 
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Tlio  iKMwoik  i.-iiirospomJing  lo  a  p.iriicul.u  .ilisinii  '  Mtccpi  Ldiivisis  of  siaics  for  cacti  of  ils 
conipoiicni  coiiccpis.  logcihcr  wiili  iu('  cxira  stalC'  :iai  clolinc  ilic  ciury  and  exii  poiiiis. 
Every  coniponcni  eoncept  is  fully  connected  loeviw  olher  coiiiponeiil  concept,  with 
additional  paths  leading  froni  the  entry  state  to  each  component  concept,  and  from  each 
component  concept  to  the  exit  state.  Figure  6.6  shous  a  sample  network  corresponding  lo 
.the  jli^ht  concept.  Of  course,  there  arc  many  more  ilight  component  concepts  in  the  AXIS 
domain  than  actually  appear  in  this  cxatiiplc.  Associated  with  each  arc  is  a  probability 
value,  in  a  similar  fashion  to  the  TINA  system  I.*!*''  These  probabilities  have  the  form 
P{Slaien\St<i(cn  ■  I,  ,Conlcxl),  which  is  the  po  '.iliiliiy  of  a  taking  transition  from  one 
stale  to  another  wjdiiii  a  pailiculu.-  context.  Thus.  arc  from  oriiii/i  to  r/i'\r  has 
probability  origin,  flighl).  meaning  the  piohability  of  entering  (/e.vr  from  ii/'igi/i 

within  the  context  of  the //ig/i/  network.  Prcsuinai'l'. .  this  piobability  is  relatively  high, 
since  people  usually  mention  the  de.siination  of  a  'Imhi  directly  after  mentioning  its  origin. 
Conversely,  P(origin\dc.ii,  flight)  is  probably  I  >\v  because  people  don't  usually  express 
concepts  in  that  order,  Thus,  while  all  paths  ihrtuigh  the  slate  space  are  possible,  some 
have  niueli  .uglier  probabilities  than  others. 

Within  a  concept  network,  component  concept  state'  exist  for  both  nonterminal  concepts, 
such  as  origin,  as  well  as  terminal  concepts,  such  ns  jliglii  imlkotor.  Arrows  pointing  into 
iiontcriiiiiial  states  indicate  entries  into  other  networks,  while  arrows  pointing  away  indicate 
exits  out  of  those  networks.  Terminal  states  correspond  to  networks  as  wcl'.,  although  these 
are  determined  by  the  lexical  rcali/.aiion  model  and  have  a  different  internal  structure. 

Thus,  every  nieaning  tree  directly  corresponds  directly  to  some  particular  path  through  the 
slate  space.  Figure  6.7  shows  a  meaning  tree  ami  its  corresponding  path  through  stale 
space. 

Viewed  as  a  granimat,  the  semantic  language  model  is  expressed  directly  as  a  collection  of 
iielworks  rather  than  as  a  collection  of  production  rules.  Thc.se  networks  represent 
grammatical  constraints  in  a  somewhat  different  fashion  than  do  graiimiars  based  on 
production  rules.  In  this  model,  constituents  may  appear  beneath  nonterminal  nodes  in  any 
arbitrary  order,  while  preferences  for  some  ordering.s  are  determined  through  the  use  of 
probabilities.  By  contrast,  most  grammars  limit  the  ordering  of  constituents  to  an  explicit 
set  which  is  specified  by  the  grammar  rules.  The  approach  taken  in  the  TINA  system 
eliminates  many  ordering  constraints  while  retaining  the  local  state  transition  constraints 
determined  by  ils  grammar.  We  believe  that  an  unconstrained  ordering  of  constraints 
increases  parsing  robustness,  while  the  preferences  determined  by  the  arc  probabilities  help 
minimize  overgeneration. 


CHAPTER  6. 


ciUcr  • 

► 


HUM  -  HIDDEN  UNDERSTANDI '  ' ,  MODEL 


73 


.  .  I* 


4  CMt 


Figure  6.6;  A  Partial  Network  Corre.spuniling  to  the  ATIS  Flight  Concept 
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6.3.2  I.cxical  Realization  .Model 

Jus(  as  noiiiiimiinal  tree  nodes  correspond  to  nclwoiks  in  the  seiiiuntic  language  model, 
terminal  nodes  correspond  to  tieiworks  in  the  lexic.il  realization  model.  The  din'erence  is 
rtiat  .semantic  language  networks  specify  transition  probabilities  between  states,  whiie 
•Ic.sical  realization  networks  specify  transition  probabilities  between  words.  Lexical 
realization  probabilities  have  the  form  P(uro>‘dn|u'c)i'cin  -  \,  context),  which  is  the 
probability  of  taking  a  transition  from  one  word  to  mother  given  a  particular  context. 

Thus,  PiAkQw\plcase,show  -  indicator)  is  the  pfisability  tluit  the  word  show  follows  the 
word  pU'iisi'  within  the  contc.xt  of  a  show  iiuliaitoi  -hrase.  In  addition,  there  are  two 
pseudo-words.  +hegin*  and  ’'end*,  which  indicaie  the  beginning  and  ending  of  phrases, 
I'hus,  we  have  probabilities  such  :is  Ptpleasc]  *  bi  mn*,  show-indicator),  which  is  the 
probability  that  phuisc  is  the  lirsi  word  of  a  .slum  uulivaior  phrase,  and 
P{*cnd  •  -Tne.show  indicator)  ,  which  is  the  probability  of  exiting  ti  show  iiidicattri' 
phra.se  given  that  the  previous  word  was  me. 


6.4  The  Understanding  Component 

\s  we  have  seen,  undcrsttmding  a  word  string  W  requires  linding  a  meaning  M  such  that 
the  probability  F(W\M)P(M)  is  maximized.  Since,  the  semantic  language  model  and  the 
lexical  realization  model  arc  both  probabilistic  networks,  Pi,W\M)P{M)  is  the  probability 
of  a  particular  path  through  the  combined  network.  Thus,  the  problem  of  understanding  is 
to  find  ihcehighcst  probability  path  among  all  possible  paths,  where  the  probability  of  a 
path  is  the  product  of  all  the  transition  probabilities  along  .hat  path. 


P(3fafe„|jfctfc„..|,confexf)  if  fin  Semantic  Language  Model 
P(.wordn\ward„..\, context)  if  fin  Lexical  Realization  Model 

(6.1) 

Thus  far,  we  have  discus.scd  the  need  to  search  among  all  meaning'  for  one  with  a 
maximal  probability.  In  fact,  if  It  were  necessary  to  search  every  path  through  the 
combined  network  individually,  the  algorithm  would  require  exponential  time  with  respect 
to  sentence  length.  Fortunately,  this  can  be  drastically  reduced  by  combining  the 
probability  computation  of  common  subpaths  through  dynamic  programming.  In  particular, 
because  our  meaning  representation  aligns  to  the  word.s,  the  search  can  be  efficiently 
performed  using  the  well-known  Viterbi  [74]  algorithm. 
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Figure  6,7:  A  Meaning  Trc(  \nd  its  Corresponding  Path  Through  State  Space 
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Siiict;  lUii'  uiuk'iiying  nioclcl  !>%  a  recursive  iransiiu’-  nclwork,  ihe  stales  fur  the  Vilerbi 
seareli  must  be  allocated  dynaitiically  as  llio  search  I'roceeds.  lii  addiiioii,  it  is  necessary  to 
prune  very  low  probability  paths  in  order  to  keep  the  coiriputaiiori  tractable,  We  have 
developed  an  elegant  algorithm  that  integrates  state  allocation.  Viterbi  .search,  and  piuning 
all  within  a  single  traversal  of  a  tree-  like  data  structure.  In  this  algorithm,  each  of  the  set 
.of  carreiuly  active  states  is  repre.sentcd  as  a  node  in  a  tree.  New  nodes  are  added  to  the 
tree  as  the  computation  pushes  into  new  subnetworks  that  are  not  currently  active.  .Stored 
at  each  node  is  the  probability  of  the  most  likely  p.ith  reaching  that  state,  together  with  a 
hackpointer  suflicient  to  recreate  the  path  I.iter  if  needed,  Whencs  er  the  probability  of  all 
states  in  a  subtree  falls  below  the  threshold  six'cilie  l  by  the  betini  witlih,  the  entire  subtree 
IS  pruned  away. 


6.5  The  Training  Compo”ent 


In  order  to  train  the  statistical  model,  we  must  estimate  transition  probttbililies  fur  the 
semantic  language  model  and  le.xical  reali/.ation  model.  In  the  case  of  fully  specilied 
meaning  trees,  each  meaning  tree  can  be  straiiilitforwardly  converted  into  a  path  through 
state  space.  Then,  by  counting  occurrence  and  transition  frequencies  along  those  paths,  it  is 
possible  to  form  simple  estimates  of  the  transition  probabilities.  Let  Cisiatem,  contexts) 
denote  tlie  number  of  times  statem  has  occurred  in  contexts,  and  let 
C7(jiaie7i|jtatem,coTifexfa)  denote  the  number  of  times  that  this  condition  has  led  to  a 
transition  to  state  staten.  Siniilaiiy,  deline  counts  C{wordrtL,conli;xtt)  and 
C(wordn\wordtn,ccniicxil).  Then,  a  direct  estimate  of  the  probabilities  is  given  by: 


P(staten\state,„,  context)  - 


C(state,^\slate^,  context) 
C{state,T,,ccmtexi) 


and 


P  {w  or  dn\w  or  dm,  context) 


Ciwordnlmor  dm,  context) 
C(wor  dm,  context) 


In  order  to  obtain  robust  estimates,  these  simple  estimates  arc  smoothed  with  backed-off 
estimates  [31],  using  techniques  similar  to  those  used  in  speech  recognition  [38,  55).  Thus, 
P*afatcn|a<afeTn,cwtest)  is  smoothed  with  P^ 3taten\context),  and 
P^wordn\wordm,context)  is  smoothed  with  F'wordnlcontext).  Robustness  is  further 
increased  through  word  classes.  For  example,  Boston  and  San  Francisco  are  both  members 
of  the  class  of  cities. 


In  the  ca.se  of  frame  based  representations,  it  is  not  always  possible  to  construct  an  exact 
path  through  the  state  space  corresponding  to  a  meaning  representation.  Nevertheless,  since 
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iViimcM  iiic  iiLMU-d  as  partially  specilicd  licos,  mo-.'  '  the  paili  c.in  ho  lecoiisiruclod,  with 

some  portions  tinJctomiiiiod.  Then,  the  partial  path  an  he  used  to  constrain  a  gradient 
descent  search,  called  the  forward-  backward  algunthni  (I8|  for  estimating  the  model 
parameters.  This  algorithm  is  an  iterative  procedure  for  adjusting  the  model  parameters  so 
as  to  increa.se  the  likelihood  of  generating  the  training  data,  and  is  an  instance  of  the 
well-known  class  called  EM  (Estimatc-Maximi/e)  algorithms. 


6.6  Experimental  Results 


We  have  implemented  a  hidden  undorsiaiuling  system  and  performed  a  variety  of 
experiinenis.  In  addition,  we  participated  in  the  AREA  A'l'lS  NL  evaluation.  One 
experiment  invohed  a  1000  sentence  AXIS  corpus,  .innoiated  according  to  a  simple 
specialized  sublanguage  tnoiiel.  The  amuitation  effoit  was  split  between  two  atinoialors, 
one  of  whotii  was  a  sysietn  developer,  while  the  other  was  not.  To  annotate  the  training 
data,  we  used  a  bootstrappitig  process  in  which  onlv  the  lirsl  100  sentences  were  annotated 
strictly  by  hand. 

Thereafter,  we  worked  in  cycles  of: 


1.  Running  the  training  program  using  alt  available  annotated  data. 

2.  Running  the  understanding  component  to  annotate  new  sentences. 

3.  Hand  cc'  '"-ling  the  new  annotations. 

Annotating  in  this  way,  we  found  that  a  single  unnotalor  could  produce  200  si'iiicnces  per 
day.  Wc  then  extracted  the  first  100  sentences  as  a  lest  set,  and  trained  the  system  on  the 
remaining  900  sentences.  The  results  were  as  follows: 

•  61%  matched  exactly, 

•  21%  had  conect  meanings,  but  did  not  match  exactly. 

•  28%  had  the  wrong  meaning. 

Another  experiment  involved  a  6000  sentence  AXIS  corpus,  annotated  according  to  a  more 
sophisticated  meaning  model.  In  this  experiment,  the  Delphi  system  automatically 
produced  the  annotation  by  printing  out  its  own  internal  representation  for  each  sentence, 
converted  into  a  more  readable  form.  In  order  to  maintain  high  quality  annotations,  we 
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Liscil  mils  \onli.'in.-CN  lor  winch  Delphi  producoil  a  C"inplcie  parse,  and  foi  which  ii  also 
I'lMricNOsl  .1  eoircci  answer  froni  ihe  daiabase.  We  ihen  reiiKivcd  3(X)  sentences  as  a  lest  set, 
and  trained  the  system  on  the  reniiuning  57(M).  The  results  were  as  follows: 

■  857f  matched  exactly. 

•  87r  had  cr'irect  meanings,  but  did  not  match  exactly. 

•  T<  had  the  wrong  meaning. 

I'or  Ihe  ARP, A  esahiation,  we  coupleil  oiir  hidden  understanding  system  to  ihe  discourse 
and  backend  components  of  the  IXdphi.  1,'sing  the  entire  hOOO  sentence  corpus  described 
above  as  training  slala,  the  system  produced  a  score  of  26'/!  simple  error  on  the  A'fl.S  Nl, 
evahiation.  By  exaniiniiig  inc  errors,  we  have  rcaehed  the  conclusion  that  nciirly  half  are 
due  to  simple  prograinming  issues,  especially  in  the  interface  between  Delphi  and  the 
hidden  understanding  system.  In  fact,  the  interface  wtis  still  incomplete  at  the  time  of  the 
evaluation. 

Wc  have  just  begun  a  series  of  experiments  using  frame  based  annotations,  and  are 
continuing  to  reline  our  lechniques.  bt  a  preliminary  test  invoi.’ing  a  small  corpus  of  588 
ATIS  sentences,  the  system  correctly  aligned  the  bidden  stales  for  over  957u  of  the 
sentences  in  the  corpus. 


6.7  Limitations 


Several  limitations  to  our  current  approach  arc  worth  noting.  In  a  small  number  of  cases, 
lingui.>tic  movement  phenomena  make  it  difficult  to  align  the  words  of  a  sentence  to  any 
■  tree  structured  meaning  expression  without  introducing  crossings,  in  most  cases,  we  have 
been  able  to  work  around  this  problem  by  introducing  minor  changes  in  our  annotation 
such  that  the  tree  structure  is  maintained.  A  second  limitation,  due  to  the  local  nature  of 
the  model,  is  an  inability  to  handle  nonlocal  phenomena  such  as  conference.  Finally,  in 
some  cases  the  meaning  of  a  sentence  depends  strongly  upon  the  discourse  state,  which  is 
beyond  the  scope  of  the  current  model. 
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6.8  Conclusions 


Wc  have  dcmonsiraiod  the  iKissihility  of  auiinnaiieally  learnin;’  seinanlic  repiesenlaiioiis 
tlireelly  front  u  irainiiig  eorpiis  through  the  applicatmti  of  siaiisiieal  leehnit|ties.  Empirical 
results,  including  the  results  of  an  AREA  evaluation,  indicate  that  these  technit|ues  are 
capable  of  relatively  high  levels  of  perfornianco. 

While  hiilden  understanding  models  are  based  prim.utly  on  the  concepts  of  hidtlen  Markov 
models,  we  hase  also  shown  their  relatiotiship  to  o'  ;er  work  in  stochastic  giammars  atid 
prohahtlisttc  parsing. 

I'itially,  we  hate  tutted  some  limttttitotts  to  our  cun.  tit  approtich.  We  view  etich  of  these 
litnilatiotis  as  opiioriuttities  for  further  reseatch  and  esploraiton. 
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